## LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

 Options: Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers Message: [<< First] [< Prev] [Next >] [Last >>] Topic: [<< First] [< Prev] [Next >] [Last >>] Author: [<< First] [< Prev] [Next >] [Last >>]

 Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?) From: Javier Bezos <[log in to unmask]> Reply To: Mailing list for the LaTeX3 project <[log in to unmask]> Date: Mon, 12 Feb 2001 21:45:58 +0100 Content-Type: text/plain Parts/Attachments: text/plain (149 lines)
```Some random quick remarks. I'm trying to read the huge amount
of messages

Which is the purpose of the LICR? Apparently, it's only an
intermediate step before creating the final output. That
can be true in TeX, but not in Omega because the LICR can
be processed by external tools (spelling, syntax, etc.)
There are lots of tools using Unicode and very likely there
will be more in a future. However, there are only a handful
of tools understanding the current LICR and it's unlikely
there will be more (they are eventually expanded and therefore
cannot be processed anyway, the very fact that unicode chars
are actual `letter' chars is critical). So, having true
Unicode text (perhaps with tags, which can be removed if
necessary) at some part of the internal processing is imo
an essential feature in future extensions to TeX. And indeed
Omega is an extension which can cope with that; I wouldn't like
renounce that.

Another aim of Omega is handling language typographical
features without explicit markup. For instance: German "ck,
Spanish "rr, Portuguese f{}i, Arabic ligatures, etc. Of course,
vf can handle that, but must I create several hundreds of
vf files only to remove the fi ligature? Omega tranlation
processes can handle that very easily.

[Marcel:]
>  > Anyway, Frank, I just got your last mail in my inbox (need to read the
>  > details more carefully), and I think we agree that it's worth
>  > exploring if there would be a substantial advantage for having some
>  > engine with Unicode internal reprentation.
> [Frank:]
> it surely is, though i'm not convinced that the time has come, given that the
> current LICR actually is as powerful (or more powerful in fact) than unicode
> ever can be.

[Roozbeh:]
>  > Please note that with different scripts, we have different font
>  > classifications also. I'm not sure if the NFSS model is suitable for
>  > scripts other than Latin, Cyrillic, and Greek (ok, there are some others
>  > here, like Armenian).
> [Frank:]
> i grant you that the way I developed the model was by looking at fonts and
> their concepts available for languages close to Latin and so it is quite
> likely that it is not suitable for scripts which are quite different.
>
> However to be able to sensibly argue this I beg you to give us some insight
> about these classifications and why you think NFSS would be unable to model
> them (or say not really suitable)

I think that Roozbeh refers to the fact that the arabic script does
not follow the occidental claasification of fonts (serif, sans serif,
typewriter)

The draft I've written for lambda will allow to say:

\scriptproperties{latin}{rmfamily = ptmr, sffamily = phvr}
\scriptproperties{greek}{rmfamily = grtimes, sffamily = grhelv}

(names are invented) but as you can see, it still uses rm/sf/tt
model. If I switch from latin to greek and the
current font is sf (ie, phvr), then the greek text is written using
grhelv, but which is the sf equivalent in Arabic script?

Javier
_________________________________________________________________
Javier Bezos                    | TeX y tipografia

PS. I would also apologize for discussing a set of macros which
has not been made public yet, but remember it's only a
draft and many thing are liable to change (and maybe
the final code can be quite different. As we Spaniards say,
perhaps "no lo reconocerá ni la madre que lo parió"). Anyway,
I'm going to reproduce part of a small text I sent to the Omega
list sometime ago. I would like to note that I didn't intend to
move the discussion from the Omega-dev list to this one -- it just
happened.

==========
Let's now explain how TeX handle non ascii characters. TeX
can read Unicode files, as xmltex demostrates, but non ascii
chars cannot be represented internaly by TeX this way. Instead,
it uses macros which are generated by inputenc, and which are
expanded in turn into a true character (or a TeX macro) by
fontenc:

é --- inputenc --> \'{e}  --- fontenc --> ^^e9

That's true even for cyrillyc, arabic, etc. characters!

Omega can represent internally non ascii chars and hence
actual chars are used instead of macros (with a few exceptions).
Trivial as it can seem, this difference is in fact a HUGE
difference. For example, the path followed by é will be:

é --an encoding ocp-|           |-- T1 font ocp-->  ^^e9
+-> U+00E9 -+
\'e -fontenc (!)----|           |- OT1 font ocp -> \OT1\'{e}

It's interesting to note that fontenc is used as a sort of
input method! (Very likely, a package with the same
funcionality but with different name will be used.)

For that to be accomplished using ocp's we must note that we
can divide them into two groups: those generating Unicode from
an arbitrary input, and those rendering the resulting Unicode
using suitable (or maybe just available :-) ) fonts. The
Unicode text may be so analyzed and transformed by external
ocp's at the right place. Lambda further divides these two
groups into four (to repeat, these proposals are liable to
change):

1a) encoding: converts the source text to Unicode.
1b) input: set input conventions. Keyboards has a limited
number of keys, and hands a limited number of fingers.
The goal of this group is to provide an easy way to enter
Unicode chars using the most basic keys of keyboards
(which means ascii chars in latin ones). Examples could
be:
*  --- => em-dash  (a well known TeX input convention).
*  ij => U+0133 (in Dutch).
*  no => U+306E [the corresponding hiragana char]

Now we have the Unicode (with TeX tags) memory representacion
which has to be rendered:

2a) writing: contextual analysis, ligatures, spaced punctuation
marks, and so on.
2b) font: conversion from Unicode to the local font encoding or
the appropiate TeX macros (if the character is not available in
the font).

This scheme fits well in the Unicode Design Principles,
which state that that Unicode deals with memory representation
and not with text rendering or fonts (with is left to "appropiate
standars"). Hence, most of so-called Unicode fonts cannot
render properly text in many scripts because they lack the
required glyphs.

There are some additional processes to "shape" changes (case,
script variants, etc.)
```