At 14:20 +0100 2001/02/17, Frank Mittelbach wrote: There appears to be two variations, one based on the original TeX, and one with TeX having some kind of extensions. As for the second approach, it seems me that the internal representation should be 32-bit Unicode. As TeX does not seem well equipped handling the encoding issues, one should then hook up a preprocessor providing the suitable translations. Thus whatever encoding -> preprocessor -> UTeX This easy-to-write preprocessor can combine combining characters to single Unicode characters, if possible, or otherwise write them on a form that UTeX easily can handle, say by switching from postfix to prefix notation, or whatever. With further tweaking of the TeX engine it could even combine TeX combinations such as "--", "---" into single Unicode characters. It is further easy to write such translators for reading/writing to files. Thus the picture becomes - trans C (eg Uppercasing) | | | | Hardwired translations V | whatever --> decode --> Unicode -------> trans B --> ^^e9 ^ | | | | | files: Choice of say utf8, Unicode > > Omega can represent internally non ascii chars and hence > > actual chars are used instead of macros (with a few exceptions). > > Trivial as it can seem, this difference is in fact a HUGE > > difference. For example, the path followed by é will be: > > > > é --an encoding ocp-| |-- T1 font ocp--> ^^e9 > > +-> U+00E9 -+ > > \'e -fontenc (!)----| |- OT1 font ocp -> \OT1\'{e} Thus, one does not bother representing ASCII or any other such encoding internally anymore. Combinations such as \'e are no longer necessary, except for backwards compatibility. It will probably not make much difference what you do with those combinations, because documents that for archival purposes need absolute compatibility can use the old TeX -- newer documents can be translated, if necessary, at need. (That is, this will work, if the incompatibilities are not too big.) > trans A = tokenising 8 bit numbers as the corresponding 16bit numbers > Example: > if é was in the cp437 code page (German DOS) it would > be the 8bit char "82; that would become the 16bit token with > number "0082 (which is NOT é in unicode = "00E9) > > if on the other hand é was in latin1 (where it is "E9) we > get "00E9 Such problems would not happen within UTeX, as the preprocessor would provide the proper translation to \U000000E9. Hans Aberg