Philipp Stephani a écrit :
> Current implementation strategies for strings in development environments
> define one Unicode encoding scheme (UTF-16 in nearly all cases like Windows,
> Java, Python, Qt, .NET, COM, Cocoa, Carbon; a few technologies like Gnome and
> Emacs choose UTF-8 instead) that is used exclusively for internal processing,
> and define "strings" as sequences of UTF-16 or UTF-8 code units. LaTeX could
> do the same, depending on the engine: UTF-8 for pdfTeX, UTF-16 for XeTeX.
> Other possibilities (e.g. LICR or UTF-32) are probably either too complicated
> or not flexible enough.
For the record, LuaTeX uses what you might call UTF-32 internally (a "character"
is a Unicode code-point, no more, no less).
My humble opinion is that LaTeX3 should define a character as being whatever the
underlying engine thinks is a character. That is, a "character" should be a
"character token" (with the catcode ignored or, equivalently, normalised):
- for pdfTeX, an 8-bit number
- for XeTeX, a 16-bit number
- for LuaTeX, a number in the range 0 -- 0x10ffff
This way, the format does not need to hack extensively (as LaTeX2e does) around
the engine's limitations, and can let the engine do his job, and concentrate on
his own job as a macro package. (Sort of Unix philosophy: do one thing, do it well.)
I mean, LaTeX2e *had to* hack around the encoding limitations of pdfTeX because
there was no alternative, but now there are.
Just my 2 cents.