LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

Options: Use Classic View

Use Proportional Font
Show Text Part by Default
Show All Mail Headers

Topic: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Bruno Le Floch <[log in to unmask]>
Mon, 10 Oct 2011 11:07:12 -0400
text/plain (85 lines)
Hello all,

We just added on CTAN two related modules: l3str (string manipulation)
and l3regex (regular expression matching and replacement). This is a
request for comments on what (?:is needed|would be nice to have) in
this general area. I will explain what is currently there, and
hopefully you can tell me I got everything wrong (in what I covered,
and in the doc). Example:

\str_input:Nn \l_foo_str { Hello, \ \#\$\% \x20 world! }
\regex_replace_all:nnN { \b (\w)+ \b } { (\0:\1) } \l_foo_str

results in \l_foo_str containing "(Hello:o), #$% (world:d)!"


The l3str module provides functions to get the length of a string,
extract substrings or individual characters, testing for string
equality (the curent \str_if_eq:nnTF). Some support for encodings is
provided: percent encoding, conversion from utf-8 to a string of
bytes, and most functions of Heiko Oberdiek's pdfescape package.

The approach taken in l3str is that any user input is \detokenize-d
before the string function is used. Each function has ':n' and ':N'
variants accepting any token list (or token list variable) as an
input, escaping spaces for internal use, and applying the operation,
as well as an "ignore_spaces" variant which does not escape spaces.
For instance, \str_length:N and \str_length:n count the number of
characters in the results of \tl_to_str:N and \tl_to_str:n, while
\str_length_ignore_spaces:n counts non-space characters (and is
faster).

An input function is provided, where any non-letter character can be
escaped, and some sequences such as \n or \x{0A} are recognized (both
happen to be the new-line character ^^J).

- The names of the conversion functions are quite bad.
- Most encoding functions (as opposed to decoding) could be made
expandable. Is it worth it?
- There may be need to support other encodings. Which ones?
- It is not clear how utf-8 input/output should be treated in pdftex.
- Missing functions? What would you wish for?
- Some day I may add printf-like string formatting. Is that useful?


The l3regex module allows for testing if a string matches a given
regular expression, counting matches, extracting submatches, splitting
at occurrences of a regular expression, and doing replacement (see
documentation for function names).

Speed requirements forbid a back-tracking approach, hence
back-references cannot be supported. Only "truly regular" features are
implemented. The syntax is identical to PCRE (except for space
handling), hence very similar to perl and to POSIX regular
expressions. Details on the syntax can be found in the doc.

If a regular expression is used a lot, a precompiled version of that
expression can be stored in a token list. In principle, that can be
written to a file, but it is not a very compact notation, so if that
turns out to be useful, we can improve it.

- Newlines. Currently, "." matches every character; in perl and PCRE
it should not match new lines. As you know, the situation with new
lines in TeX is a little bit odd, since they are converted to the
\endlinechar upon reading, and normally not tokenized, simply giving
rise to a space or a \par. Should we still decide that "." does not
match the CR nor LF characters? Or should it simply no match the
\endlinechar?

- I had the idea of providing # as a shorthand for .*? (arbitrary
sequence of characters, lazy), mimicking what TeX does when finding a
macro parameter. Is it useful?

- Same question for caseless matching, and for look-ahead/look-behind
assertions.

- A facility for matching a balanced group (e.g., as xparse does for
optional arguments)? That is non-regular, and is difficult to
implement, so I will only look at it if it is really needed.


Comments, feature requests, bug reports etc. are all highly welcome.
Thanks for reading so far :).
-- 
Bruno Le Floch

ATOM RSS1 RSS2