LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Classic View Use Proportional Font Show Text Part by Default Condense Mail Headers
Topic:	[<< First] [< Prev] [Next >] [Last >>]

Sender: Mailing list for the LaTeX3 project <[log in to unmask]>

Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)

From: David Carlisle <[log in to unmask]>

Date: Mon, 12 Feb 2001 10:25:21 GMT

In-Reply-To: <[log in to unmask]> (message from Frank Mittelbach on Sun, 11 Feb 2001 20:38:40 +0100)

Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]>

Parts/Attachments: text/plain (53 lines)

>  > What about symbol fonts like TC? What about math characters that are
>  > unified in Unicode (\rightarrow and \longrightarrow)? What about the
>  > things that are not yet in Unicode?

> yes, what about them?

It may be worth noting that unicode 3.1 and 3.2 will (assuming the
current plans go through) have a lot more (~1500 more, If I recall
correctly) math characters than unicode 3.0. Actually one of the main
things missing (currently) are long arrows. We (MathML working group)
are in touch with the Unicode folks to see if there's any chance of
those being added as well, although time is getting short for further
additions to 3.1 and 3.2)

There are always the private use areas of course for extra characters
that a TeX/Unicode system could use. (Although private use characters
for a publicly distributed system is considered bad form, sometimes it
can't be avoided)

Having built a TeX (rather than omega) based system (xmltex) that does
use utf8 as the internal form of all characters I'd agree with the
comments made earlier that one of the hardest problems are unicode
combining characters. (xmltex doesn't deal with them at all by default).
xmltex of course makes almost no use of TeX's inbuilt csname parsing
for document files, as it reads xml syntax. It can run with all
characters being active (which would be needed to handle combining
characters) but normally ascii characters are non active which is a big
time saving for languages that are mainly latin alphabet.

In xmltex one is more or less forced to use utf8 as you are accepting
XML character streams with no explicit markup, however for a system
using TeX style markup it isn't at all clear that the benefits would
outweigh the costs. Changing to a utf8 internal form would make latex
slower (a lot slower if it handled combining characters) and for the
majority of existing users it would have no advantage (so they would use
the old system, and not update).

For specific uses (in particular typesetting sources derived from XML)
it is possible to layer utf8 support over the current base. There the
costs are worth it (as there is really no alternative to UTF8 support,
either you code the utf8 support in TeX as in xmltex (or similar code in
cjk package and I think there's a utf8.sty on ctan) or you have an
external program do the translation.

If the TeX engine changes to Omega (or an Omega like system  that is
similarly based on unicode) then the rules change completely, and
unicode as the internal form becomes a lot more attractive prospect.
However I'm not sure that we are quite ready to switch all TeX
distributions to Omega as the default engine for LaTeX are we?


David

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung