On 10/02/2010 10:09, Heiko Oberdiek wrote:
>> The first "big" question is what exactly is a string in a TeX
> Looking at the result of \string, \meaning, \detokenize, \jobname, ...
> it's quite clear:
> A string is a sequence of catcode 12 tokens with the exception
> that the space (charcode 32) has catcode 10.
I was intending to mean not as far as the engine is concerned (where of
course you are correct) but more for programmers (taking account of the
point that the existing "strings" packages don't seem to take the TeX
approach to what constitutes a string).
> The latter is quite unhappy, because it makes string processing
> unnecessary troublesome. For example, a space cannot be catched
> as undelimited parameter.
> ==> Question: Catcode of space (10 or 12)?
This occurred to me, too. I was leaning toward "everything catcode 12"
for that reason.
>> You also have to worry about what happens about special characters
>> (for example, how do you get % into a string). If you escape things
>> at the input stage [say \% => % (catcode 12)] then a simple
>> \detokenize will not work.
> I think that's beyond a string module. At TeX input level, you can only
> input tokens under the authority of the current catcode settings.
> The input of the string module would rather be token lists, that
> get converted to strings, basically catcode 12 tokens. And the output
> would also be strings.
Again, I was thinking about what is already "out there". You'll see that
stringstrings, for example, takes a rather detailed approach to this
type of problem. Whether that is right I'm not sure!
> * For dealing with PDF or PostScript it is useful to have conversions
> from and to PDF/PS names, PDF/PS strings and hex strings.
> See pdfTeX primitives \pdfescapehex et. al.
> and packages `pdfescape' and `pdftexcmds'.
> * Encoding conversions, see package `stringenc'.
> Application: PDF (outlines and other text fields).
At present we seem to have stayed away from encodings. My own preference
is to leave things to LaTeX2e when working as a package and to use the
"native" encoding only for the format (with UTF-8 engines available this
seems sensible to me).
> * Matching (replacing) using regular expressions,
> see \pdfmatch and luaTeX.
> Matching is useful for extracting information pieces or
> validating option values, ...
> Unhappily \pdfmatch has still the status "experimental"
> and the regular expression language differs from Lua's.
I think we'll be staying away from this. XeTeX has no equivalent of
\pdfmatch, and as you say the LuaTeX version works differently from the
pdfTeX one. [At present, we only *require* e-TeX in any case, although
an engine with \(pdf)strcmp available is very useful.]