Print

Print


At Tue, 5 Jun 2001 13:29:19 +0100, Chris Rowley wrote:
[...]
>Therefore, rather than attempting to categorise the necessary
>information and devise suitable ways to provide it, Frank and I came
>up with the idea of simply supplying a single logical label for every
>ICR string.  Since the first, and still the overwhelmingly most
>diverse,
>parts of this information came from the needs of multi-lingual
>documents, we called this label the `language' (maybe not a good
>choice).  Our thesis is that `every text string must have a
>language-label'.  The only property these labels need (and indeed are
>able) to have is that they \emph{can} help any application or
>sub-process to access the information it needs to process that text
>string.

I suggest that we use the term `context' rather than `language' here.
Quoting Webster's, `context' means:

   The part of a written discourse in which a certain word, phrase
   or passage appears, necessary to point the meaning, as, it is
   hard to tell the exact meaning of a word out of context.

[snip]
>[In order to distinguish these logical language-labels from anything
>else in the TeX world let us call them LLLs.]
>
>In the context of current TeX-related systems this
>means that:
>
>-- whenever a character token list (in an ICR) is constructed or
>   moved, then its LLL must go with it;

The most common event at which a character token list is formed is when a
command is grabbing one of its arguments. With the xparse package in full
control these arguments can be labelled under the current TeX engine, but
it is probably more reasonable to imagine that their attachment is handled
by primitive mechanisms in some extension of TeX. In this case, I suspect
the labels should be thought of as being nestable with separate markers for
beginning and end, so that each token list that is formed gets delimited by
matching begin and end labels that record the current context of the token
list they were extracted from. Thus if we have, in an English context

   \subsubsection{The use of <begin-swedish>\"alv<end-swedish>}

(where the <..> denote such context labels), the token list becoming the
argument of \subsubsection would be

   <begin-english>The use of <begin-swedish>\"alv<end-swedish><end-english>

And then it doesn't matter if it is inserted into a French context table of
contents. Upon being written to an external file, the labels should be
converted to suitable markup.

An interesting question is whether these labels should be explicit tokens
or be hidden from the user (i.e., argument grabbing and things like
\futurelet look past them). Making them explicit tokens would probably
break tons of code.

As for what the labels should be to the user, I think a scheme of making
them integers is pretty useless (how they are implemented is of course
another matter). A better idea would be to make them some kind of property
lists, i.e., containers for diverse forms of information that are indexed
by some kind of names. Creating new label values from old by copying the
values and then changing some would be useful when defining dialects.

The main problem I see with context labels is that of when they should be
attached, since one cannot do any context-dependent processing before the
context is determined. I can think of at least three different models:

1. Labels must be present in the input (e.g. encoded using control
characters). This might be nice from an implementation point of view, but
it is probably only realistic if such a system would emerge which is
accepted in a much wider community than that of the users of TeX, due to
the problem of finding suitable editors. This doesn't seem likely.

2. Do as today, i.e., context switches are initiated when commands are
executed. This has the problem that the context isn't completely known
until the text is being typeset, so one cannot do any irreverible
context-dependent processing until then. This seems a bit too restrictive
to me.

3. Have command-like markup for context-switching, but attach labels as
part of the tokenization. This has the merit of looking like current LaTeX
markup and allowing LaTeX to keep all ICR strings fully context-labeled,
but it would also mean that processing of markup is a two-step process
(first all language markup is processed, then all the rest). That doesn't
feel right.

Lars Hellström