LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

> BOMs?

Byte Order Mark. (which is mainly for UTF16 to distinguish between big
and little endian flavours but Microsoft tools in particular tend to
stick them on utf8 files as well).

I don't think that anything special need be done for these
since the BOM (if it isn't recognised as a BOM) will be recognised as
ZERO WIDTH NO-BREAK SPACE (xFEFF) which means for a typesetting system there
isn't really a lot that needs to be done.
(except of course for the top level file where perhaps the utf8 will not
be set up early enough, and typesetting even zero width characters
before \documentclass doesn't work.

More serious problems (which make me wonder if it's worth the effort of
supporting utf8 in a standard TeX) are combining characters.
In xmltex you can make these work by making every possible base
character active and look ahead for a following combiner, but that is
turned off by default as it's not exactly fast or robust.
In LaTeX you can't do much other than make a combining accent generate an
error as you can't really make the base ascii characters active if you
are using the \abc style markup.

It's easy to make a prepass with (say) perl to get rid of the
combining characters and replace them by tex accent markup, but if you
are doing that you can replace all of the utf8 (and utf16 as well) by
traditional tex markup. this is slightly less portable but a whole lot
more robust than doing it in TeX.

The second thing that I have never really fixed in xmltex in this area
is that the style of mapping the input character to an internal csname
which you then map to a typesetting instruction is fine for supporting
small European based character sets, but it soon gets to be pain if
you are supporting large Asian character sets.

CJK package's utf8 support has an option of mapping utf8 encoded input
straight to a set of 8bit fonts encoded to map easily from utf8.
This seems much more reasonable for supporting large Unicode fonts:
Split them up as 8bit fonts so TeX can see them and trivially map to the
right font/character from the utf8 sequences. I never got this working
in xmltex though (as modifying anything in xmltex is a pain. It's not
the most documented piece of code ever produced)


David

________________________________________________________________________
This e-mail has been scanned for all viruses by Star Internet. The
service is powered by MessageLabs. For more information on a proactive
anti-virus service working around the clock, around the globe, visit:
http://www.star.net.uk
________________________________________________________________________