LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

Marcel wrote (very wisely):

> I'd like to bring the discussion back to the ICR issue, in
> particular how a hypothetical successor to TeX should handle
> input encodings.

The subsequent discussion led me to write the following, which does
not say in fact say much about input encodings; I hope that the reason
for this omission becomes clear to those who read further.


chris


Part the First:

First, I regret to say, some general theory about text characters and
languages.

Note that the ICR is a Character Representation, in the sense defined
in
the Unicode standard, and not a glyph representation or an Enhanced
Character Representation, ie one in which each `character' contains
more
information than simply its name(encoded).

A string in such an ICR is useful because it is a standardised
representation in which each character representation has a fixed
meaning \emph{as a character}.

BUT, for all the many reasons given in this discussion and several
others, having nothing more than such strings is inadequate for many
of the sub-processes involved in a typical document-processing system.
A complete analysis of these processes and the information they
require is probably not feasible and certainly not useful since they
change rapidly with the requirements of document processing.  The
only thing that can be said is that further information about these
strings must be accessible by the sub-processes of the system.

Therefore, rather than attempting to categorise the necessary
information and devise suitable ways to provide it, Frank and I came
up with the idea of simply supplying a single logical label for every
ICR string.  Since the first, and still the overwhelmingly most
diverse,
parts of this information came from the needs of multi-lingual
documents, we called this label the `language' (maybe not a good
choice).  Our thesis is that `every text string must have a
language-label'.  The only property these labels need (and indeed are
able) to have is that they \emph{can} help any application or
sub-process to access the information it needs to process that text
string.

From this it follows that all applications and sub-processes must also
ensure that this language-label is preserved with all text strings to
which it applies (ie no text string ever goes anywhere without its
language-label).  In a (near) mono-lingual document it may not be
necessary to explicitly supply the label if that label is the
document's main language but this is a (possibly non-robust)
implementation efficiency.

[In order to distinguish these logical language-labels from anything
else in the TeX world let us call them LLLs.]

In the context of current TeX-related systems this
means that:

-- whenever a character token list (in an ICR) is constructed or
   moved, then its LLL must go with it;


Part the Second:

So what does this mean for an ICR for a potential TeX/Omega?

First the simple case, in which everything that is encoded in the ICR
is `strongly internal': it is never seen in external files or passed
to
external applications.

It means that all ICR strings, and hence all token lists potentially
containing text, must have an LLL.


Thus to take one small example, based on Javier's:

  \newcommand{\foo}{\unichar{<Unicode code>}}

is what is needed simply to get something that expands to the single
ICR character :

  \unichar{<Unicode code>}

or it may expand further to a single (Unicode-encoded character) token
(the choice being made according to whether one is still restricted to
8-bit internals etc).

Since this character is used in widely distinct languages, in order
to use this within Mandarin you would then, I guess, need to put
something like this in your document:

  \begin{mandarin}
    ... \foo{} ...
  \end{mandarin}

or  \mandarintext {... \foo{} ...}.


But if you want \foo to be exclusively a bit of Mandarin text then you
could (or even should) define something like (syntax is probably
dreadful):

  \newcommand{\foo}{\languageIC{manadrin}{\unichar{<Unicode code>}}}

How clever the expansion of \languageIC needs to be will depend on how
such input will be used.


Part the Third:

All of the above is completely independent of what input scheme is
used, except that defining things like \foo defines part of a 7-bit
input scheme.  All such inputs, and all other input encodings, must at
first input time be normalised to their unique representation in the
ICR(*).  To this ICR representation of a text string, an LLL will need
to be added, explicitly or implicitly.

IMPORTANT: After that first time input conversion the input encoding
that was used is unknown and not needed; this is a vital property of
our ICR model.

(*) This could be done by several methods, depending on the
application doing it; within Omega the following will, we hope, be
available: input-translation; expansion; character-token-list OTPs.


Part the Fourth:

Unfortunately, in a multi-pass system such as LaTeX the LICR is not
`strongly internal': information that is internal to LaTeX must be
stored in external files.  If these could be made effectively internal
then this would not be a problem but there are two difficulties with
that approach:

-- although not explicitly, current LaTeX makes these files readable
   and editable (at least by English users:-);

-- very similar information is written out for use by other
   applications.

[Note that here I am not thinking of writing to the terminal since
that
is a one-way process and the output is useless if not immediately
comprehensible (it is thus a really tricky problem).  The status of
the log file is unclear (I would perhaps treat it like the terminal).]

Thus this leads directly to these questions (two, since different
choices may be needed):

1.  How should LICR strings be written out to files used only by LaTeX
    itself?

2.  How should LICR strings be written out to files read by other
    applications?

My feeling is that the answer to 1. should, if possible, be something
independent of any input schemes in use.

It is not so clear that this is possible for 2. and there may be good
reasons why these two outputs should be the same.


Part the Fifth:

So have I removed the question: "do we need to record the input
encoding?"?  Or merely cleverly hidden it?