Joseph Wright skrev:
> Hello all,
> One of the questions that was raised recently on c.t.t concerning the
> currently available LaTeX3 modules was the lack of "strings"
> The first "big" question is what exactly is a string in a TeX context.
Indeed, there are several possible interpretations:
(1) Sequences of character tokens
(2) Sequences of character tokens with normalised catcodes
(3) Sequences of characters from some alphabet (possibly large),
representation not necessarily native
(4) Sequences of LICRs.
Which you want to use depends on what is being targeted, i.e., what
strings are going to be used for. \write, \special, and \csname are
probably the main consumers, and since these want (1), that's probably
the main thing to support.
But it is also important to (eventually) provide conversions between
different string-like concepts. One should not expect one size to fit all.
> You also have to worry about what happens about special characters (for
> example, how do you get % into a string). If you escape things at the
> input stage [say \% => % (catcode 12)] then a simple \detokenize will
> not work.
For manual entering of string data, one might well find that (3) or (4)
is most practical...
> On features, things that seem to be popular:
> - Substring functions such as "x characters from one end", "first x
> characters", etc.
> - Search functions such as "where is string x in string y".
...whereas searching typically requires (2).
I would suggest that core string module would primarily operate on the
(1) kind of string, possibly requiring (2) for some operations, and
providing the necessary conversion operation 1->2 (trusting the user to
apply it where necessary, rather than building it into each and every
operation just to be on the safe side).
Heiko Oberdiek wrote:
> * Encoding conversions, see package `stringenc'.
> Application: PDF (outlines and other text fields).
This is, at least for the input, rather (3) or (4). Or are you
anticipating character sets larger than ^^@--^^ff for the underlying
engine? Then one conversely needs an "octet string" concept, for
\special and the like.
> * Matching (replacing) using regular expressions,
> see \pdfmatch and luaTeX.
> Matching is useful for extracting information pieces or
> validating option values, ...
Be aware that matching is one thing, extracting information pieces a
somewhat trickier concept, and replacing even more so (from the CS
theory point of view).
> Unhappily \pdfmatch has still the status "experimental"
> and the regular expression language differs from Lua's.
The last time I looked, Lua's "regular expressions" were not regular
expressions[*] at all, but rather a kind of beefed-up glob pattern
(with a regexp-like syntax), so I wouldn't be sad if LaTeX was to
deviate from Lua in that respect. I would be sad if something is called
regular expression that really isn't.
[*] There are several equivalent and perfectly formal definitions of
what it means to be "regular" as in "regular expression", the most
familiar of which is probably that a regular language is one that can
be recognised by a finite automaton. POSIX regexps are very close to
this (the only irregular feature being backreferences), whereas Perl's
"regexps" are way out in context-free-land. Lua's matching engine,
OTOH, is too weak to recognise arbitrary regular languages.