LATEX-L Archives

Mailing list for the LaTeX3 project


Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Marcel Oliver <[log in to unmask]>
Reply To:
Mailing list for the LaTeX3 project <[log in to unmask]>
Sun, 11 Feb 2001 17:45:44 +0100
text/plain (85 lines)
Frank Mittelbach writes:
 > in 1992/3 when we worked on shaping the ideas of the LaTeX internal
 > representation we actually did discuss similar ideas but back then
 > abandoned them because of resource constraints (in the
 > software). Machines are nowadays bigger and faster so this isn't
 > really much of an argument there.
 > So... time for another attempt?

Yes, yes, yes!

 > The LaTeX internal character representation is a 7bit
 > representation not an 8bit one as UTF8. As such it is far less
 > likely to be mangled by incorrect conversion if files are exchanged
 > between different platforms. I have yet to see that UTF8 text
 > (without taking precaution and externally announcing that a file is
 > in UTF8) is really properly handled by any OS platform. Is it?

Not at the moment.  But there is a strong movement pushing for UTF8 as
_the_ encoding standard.  Support in bleeding edge versions of a lot
of software is actually quite good.  As far as I can see, UTF8 is the
only standard that has a reasonable chance of becoming the one that
"works without taking precaution".

It would be a pity if LaTeX missed the boat.  For some info, see

 > TeX is 7bit with a parser that accepts 8bit but doesn't by default
 > gives it any meaning. On the other hand Omega is 16bit (or more
 > these days?) and could be viewed as internally using something like
 > Unicode for representation.

This is good because UTF8 is a proper superset of what TeX is
currently taking as input.  Does anybody know about the state of
Omega?  16-bit Unicode is not the whole game, and also not particularly
attractive as an input encoding.

 >  wouldn't it be better if the internal LaTeX representation would
 >  be Unicode in one or the other flavor?

Yes, because:

- A LaTeX specific naming scheme will be essentially unmaintainable.

- LaTeX could eventually be made to output diagnostics and log files
  in UTF8.  For example, UTF8-enabled Xterms exist now, and will
  likely come as default on Linux distributions long before LaTeX3.
  Same for text editors.  So the infrastructure is getting to a state
  where it's possible to pull this off.

 >  - however, not clear is that the resulting names are easier to
 >    read, eg \unicode{00e4} viz \"a.

See remark about Xterms.

 >  - the current latex internal representation is richer than unicode
 >    for good or worse, eg \" is defined individually as
 >    representation for accenting the next char, which means that
 >    anything \"<base-char-in-the-internal-reps> is automatically
 >    also a member of it, eg \"g.

This is not necessarily a problem (cf. Roozbeh's remark about math
symbols which are not properly defined in unicode).  As long as there
are no hyphenation and nongeneric kerning issues involved (and those
seem only an issue for natural language scripts), one could still have
named symbols as they exists now, whether they part of some special
font, a combination of glyphs, the drawing of some box, etc.

 >  - the latter point could be considered bad since it allows to
 >    produce characters not in unicode but independently of what you
 >    feel about that the fact itself as consequences when defining
 >    mappings for font encodings. right now, one specifies the
 >    accents, ie \DeclareTextAccent\" and for those glyphs that
 >    exists as composites one also specifies the composite, eg
 >    \DeclareTextComposite{\"}{T1}{a}{...}  With the unicode approach
 >    as the internal representation there would be an atomic form ie
 >    \unicode{00e4} describing umlaut-a so if that has no
 >    representation in a font, eg in OT1, then one would need to
 >    define for each combination the result.

This seems to be the logical thing to do?!