On Wed, Feb 10, 2010 at 08:51:04AM +0000, Joseph Wright wrote:
> One of the questions that was raised recently on c.t.t concerning
> the currently available LaTeX3 modules was the lack of "strings"
> functionality. Looking on CTAN, I can find a number of packages
> providing some string-like functionality:
>
> - substr
> - coolstr
> - stringstrings
> - xstrings
>
> (plus some functions in other packages). I'm sure there are also others.
>
> Taking a look through them, I can find some similarities but also a
> number of differences. Before trying to create some kind of "l3str"
> package, I thought it might be useful to see what the feeling is
> about (a) if we need this at all (b) what constitutes a string (c)
> what functions are needed and (d) anything else!
>
> On (a) my feeling is that some kind of tools are needed, given the
> clear desire to have them (see my list above of current packages).
> However, perhaps others disagree as l3tl does provide a number of
> useful features already.
>
> The first "big" question is what exactly is a string in a TeX
> context.
Looking at the result of \string, \meaning, \detokenize, \jobname, ...
it's quite clear:
A string is a sequence of catcode 12 tokens with the exception
that the space (charcode 32) has catcode 10.
The latter is quite unhappy, because it makes string processing
unnecessary troublesome. For example, a space cannot be catched
as undelimited parameter.
==> Question: Catcode of space (10 or 12)?
> If you look at the existing packages, they take differing
> approaches to handling items inside what they call strings. For
> example, some would consider "ab{cde}f" to be a string of four
> items: "a", "b", "cde" and "f", whereas other approaches would
> remove the "{" and "}" tokens.
This is IMHO more a question about converting a token list
to a string. Looking at TeX/pdfTeX:
\message, \write, \detokenize, \pdf(un)escapehex
* command name tokens get an additional space at the end
* most primitives are expanding, exception is \detokenize.
\unexpanded can be used to prevent expansion.
A short test file for playing:
\def\macro{foobar}
\def\x{[\relax\macro]}
\def\msg#{\immediate\write16}
\msg{\string\x: \x}
\message{unexpanded message \string\x: \unexpanded\expandafter{\x}}
\message{expanded message \string\x: \x}
\msg{detokenize: \detokenize\expandafter{\x}}
\msg{meaning: \meaning\x}
\msg{\pdfunescapehex\expandafter{\pdfescapehex\expandafter{\x}}}
\csname @@end\endcsname\end
The token list "ab{cde}f" with usual catcodes would become
"ab{cde}f" as string consisting of tokens with catcode 12.
> An obvious suggestion is that a
> string is something which has been \detokenize'd, but then you have
Comparing with \message, \write, \pdf(un)escapehex the more
natural approach would be a expanding \detokenize (\edef followed
by \detokenize). If expandibility is needed, then there isn't
a direct approach in TeX or e-TeX. At least pdfTeX offers a way:
\pdfunescapehex\expandafter{\pdfescapehex{...}}
The primitive \expanded is promised for pdfTeX 1.50.
> You also have to worry about what happens about special characters
> (for example, how do you get % into a string). If you escape things
> at the input stage [say \% => % (catcode 12)] then a simple
> \detokenize will not work.
I think that's beyond a string module. At TeX input level, you can only
input tokens under the authority of the current catcode settings.
The input of the string module would rather be token lists, that
get converted to strings, basically catcode 12 tokens. And the output
would also be strings.
> On features, things that seem to be popular:
> - Substring functions such as "x characters from one end", "first x
> characters", etc.
> - Search functions such as "where is string x in string y".
> - Case-changing functions.
* For dealing with PDF or PostScript it is useful to have conversions
from and to PDF/PS names, PDF/PS strings and hex strings.
See pdfTeX primitives \pdfescapehex et. al.
and packages `pdfescape' and `pdftexcmds'.
* Encoding conversions, see package `stringenc'.
Application: PDF (outlines and other text fields).
* Matching (replacing) using regular expressions,
see \pdfmatch and luaTeX.
Matching is useful for extracting information pieces or
validating option values, ...
Unhappily \pdfmatch has still the status "experimental"
and the regular expression language differs from Lua's.
Yours sincerely
Heiko <[log in to unmask]>
|