## LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

 Options: Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers Message: [<< First] [< Prev] [Next >] [Last >>] Topic: [<< First] [< Prev] [Next >] [Last >>] Author: [<< First] [< Prev] [Next >] [Last >>]

 Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) From: Frank Mittelbach <[log in to unmask]> Reply To: Mailing list for the LaTeX3 project <[log in to unmask]> Date: Sun, 11 Feb 2001 20:38:40 +0100 Content-Type: text/plain Parts/Attachments: text/plain (68 lines)
I asked the question:

> >  wouldn't it be better if the internal LaTeX representation would be Unicode
> >  in one or the other flavor?

Roozbeh replied:

> What about symbol fonts like TC? What about math characters that are
> unified in Unicode (\rightarrow and \longrightarrow)? What about the
> things that are not yet in Unicode?

as I outlined already in other replies I don't think that unicode or UTF8 is
the answer as far as LICR is concerned. it can only provide a partial answer

- it clearly can't provide the answer for chars not existing in unicode
- and it clearly can't provide the answer for math

however LICR (or the part I'm talking about) isn't really concerned with math
which needs a far richer, or lets say different handling anyway; and which
on the other hand doesn't need some of the mechanisms needed for text
representations, like being aware of  certain type of font attribute changes
etc.

> >  - however, not clear is that the resulting names are easier to read, eg
> >    \unicode{00e4} viz \"a.
>
> They are worse than you may think. They are always hard to read. My real
> work is related to Unicode Arabic script, and after two years of full
> dedication, I can't recall more than a few codes. I always need a table at
> hand. I have much less experience with Knuthian names of math symbols, but
> I'm sure I can recall the names of more than 95% of them without any
> problem.

so you agree with me, they aren't easy to read :-) but then being "internal"
this only matters in some circumstances and Oliver put some good arguments
forward when something like UTF8 might actually be easier to read.

> >  - with intermediate forms like data written to files this could be a pain and
> >    people in Russia, for example, already have this problem when they see
> >    something like \cyr\CYRA\cyrn\cyrn\cyro\cyrt\cyra\cyrc\cyri\cyrya.  In case
> >    of unicode as the internal representation this would be true for all
> >    languages (except English) while currently the Latin based ones are still
> >    basically okay.
>
> This is a place where UTF8 helps a lot. People can use Unicode text
> editors to see the files, or use the widely available convertors like
> iconv to convert to theoretically every charset.

yes and no, I tried to explain that there are limitations posed by the current
implementation of the major underlying formatter (ie TeX) which you can't
easily overcome and even if you do: which then needs a long time to get
actually being deployed at sites that have not much use for anything other
than ASCII plus perhaps a few accents.

> Unicode also has the equivalent of \", it only appears after the letter.
> So the problem of a accented letter not in Unicode is not a real problem,
> these letters can also be made in Unicode. But I don't know what are you
> going to do with the combining accent appearing after the letter.

ahh here is the remark i was searching for an hour ago:

nothing really and that is a problem as long as i want to stick with TeX and a
bit of its parsing machinery. and that means i can't make use of this concept.

frank