Hi all,
> trans A = tokenising 8 bit numbers as the corresponding 16bit numbers
> Example:
> if é was in the cp437 code page (German DOS) it would
> be the 8bit char "82; that would become the 16bit token with
> number "0082 (which is NOT é in unicode = "00E9)
>
> if on the other hand é was in latin1 (where it is "E9) we
> get "00E9
>
> if the input was in utf8 you would not get unicode chars as
> the result but sequences of 16bit chars all starting with "00
> --- so unicode charcater that is multibyte in utf8 would not
> become the corect unicode 16-bit token but would become a sequnce
> of tokens each of the form "00
Well, you answered that :-). utf-8 is a very interesting case because
chars can be represented with one, two, etc. bytes. If you say
\def\xx#1#2{}
\xx aé % two chars in utf8
then \xx doesn't understand it and gets #1 -> a, and #2 -> Ã, which
is wrong. If we preprocess the document to convert utf8 to two bytes
then things will work.
Preprocessing is also important if we want to give the possibility
of writing macro names with non ASCII chars, like Spanish \capítulo.
This feature could seem mostly useless in latin scripts (I could
write \capitulo), but in other scripts is essential, particularly
those not written from left to right. Making non-ASCII chars
active means that we cannot provide this feature.
However, this kind of preprocessing poses some problems which I've
not solved yet (but I'm close, or so I hope). You provide an
example in a subsequent message:
[...]
> % the following fails (not surprisingly)
> % and can't be corrected later on
>
> \def\foo{ab
> \InputTranslation currentfile\OCPa
> cÃ?}
> \show\foo
>
>
> the second \foo will now contains the tokens
>
> \foo=macro:
> ->ab \InputTranslation currentfile\OCPa c^^c3^^a4.
>
[..]
>
> Since we have been asked to provide input encoding changes for LaTeX within
> paragraphs, eg for individual words, something like this would happen if such
> a change appears, say, inside the argument of \section.
A system to coordinate preprocess and "internal" process is necessary.
> Discussion:
> ===========
>
> The problem really is transforms of type D which are using OICR1 and are thus
> likely to break in the sense that their encoding information is lost in the
> process.
Not a problem, really. Currently the meaning of \'e depends on the
font encoding, whose `state' is always available. The same idea
can be applied to input encodings so that the encoding information
is always available.
> bla bla \french{foo} bla bla
>
> the input encoding change would not be noticed until after "foo" has already
> been tokenised (incorrectly) --- yes, we know that this example could be made
> to work using Don's \footnote trick but as with LaTeX's \verb there will be
> situations in which even more elaborate implementations will still fail due to
> tokenisation happening before any macro expansion is possible.
For instance
\def\bar{bla bla \french{foo} bla bla}
doesn't work. However, by using ocp's it will work.
And finally a little table:
====================================================
A B C D E
----------------------------------------------------
TeX a) "82 \'e * - - - - - > "E9
b) \'e \'e * - - - - - > "E9
c) "82 "82 * - - - - - > "82
====================================================
Omega a) "82 "82 "82 "00E9 "E9
b) \'e \'e "82 "00E9 "E9
A: the source file
B: a) and b) LICR, as created by \protected@edef or
\protected@write, except c) if inputenc is not used because
the font encoding is the same as the input encoding.
C: evaluation, with fully expanded tokens. In TeX this is
is the final step (= step E in Omega) with the font
codes.
D: after input encoding translation.
E: after font encoding translation and the final step in
Omega.
Cheers
Javier
|