Joseph Wright skrev: > Hello all, > > One of the questions that was raised recently on c.t.t concerning the > currently available LaTeX3 modules was the lack of "strings" > functionality. [snip] > The first "big" question is what exactly is a string in a TeX context. Indeed, there are several possible interpretations: (1) Sequences of character tokens (2) Sequences of character tokens with normalised catcodes (3) Sequences of characters from some alphabet (possibly large), representation not necessarily native (4) Sequences of LICRs. Which you want to use depends on what is being targeted, i.e., what strings are going to be used for. \write, \special, and \csname are probably the main consumers, and since these want (1), that's probably the main thing to support. But it is also important to (eventually) provide conversions between different string-like concepts. One should not expect one size to fit all. > You also have to worry about what happens about special characters (for > example, how do you get % into a string). If you escape things at the > input stage [say \% => % (catcode 12)] then a simple \detokenize will > not work. For manual entering of string data, one might well find that (3) or (4) is most practical... > On features, things that seem to be popular: > - Substring functions such as "x characters from one end", "first x > characters", etc. > - Search functions such as "where is string x in string y". ...whereas searching typically requires (2). I would suggest that core string module would primarily operate on the (1) kind of string, possibly requiring (2) for some operations, and providing the necessary conversion operation 1->2 (trusting the user to apply it where necessary, rather than building it into each and every operation just to be on the safe side). Heiko Oberdiek wrote: > * Encoding conversions, see package `stringenc'. > Application: PDF (outlines and other text fields). This is, at least for the input, rather (3) or (4). Or are you anticipating character sets larger than ^^@--^^ff for the underlying engine? Then one conversely needs an "octet string" concept, for \special and the like. > * Matching (replacing) using regular expressions, > see \pdfmatch and luaTeX. > Matching is useful for extracting information pieces or > validating option values, ... Be aware that matching is one thing, extracting information pieces a somewhat trickier concept, and replacing even more so (from the CS theory point of view). > Unhappily \pdfmatch has still the status "experimental" > and the regular expression language differs from Lua's. The last time I looked, Lua's "regular expressions" were not regular expressions[*] at all, but rather a kind of beefed-up glob pattern (with a regexp-like syntax), so I wouldn't be sad if LaTeX was to deviate from Lua in that respect. I would be sad if something is called regular expression that really isn't. Lars Hellström [*] There are several equivalent and perfectly formal definitions of what it means to be "regular" as in "regular expression", the most familiar of which is probably that a regular language is one that can be recognised by a finite automaton. POSIX regexps are very close to this (the only irregular feature being backreferences), whereas Perl's "regexps" are way out in context-free-land. Lua's matching engine, OTOH, is too weak to recognise arbitrary regular languages.