LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Forum View Use Monospaced Font Show Text Part by Default Condense Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Sender:	Mailing list for the LaTeX3 project <[log in to unmask]>
Subject:	Re: Multilingual Encodings Summary 2.2
From:	Javier Bezos <[log in to unmask]>
Date:	Fri, 11 May 2001 22:24:51 +0100
Reply-To:	Mailing list for the LaTeX3 project <[log in to unmask]>
Parts/Attachments:	text/plain (148 lines)

Frank wrote:

>you are
> looking at the thing from the current omega implementation.

I think this single sentence summarizes the actual situation.  And
considering that, I didn't make a bad job after all ;-) Most of your
reservations are related to the way omega works, and in fact I paused
the development of lambda just because of that -- it's obvious that I
can only work with known things, and the future omega is still an
unknown.

> LaTeX conceptually has only three levels: source, ICR, Output

However, I still think that it's necessary separating code processing
from text processing.  Both concepts are mixed up by TeX (and
therefore LaTeX) making, say, uppercasing a tricky thing.  Remember
that \uppercase only changes chars in the code, and that
\MakeUppercase first expands the argument and then applies a set of
transformation (including math but not including text hidden in
protected macros!).  Well, since ocp's are applied to text after
expansion (not including math but including actual text even if
hidden) we are doing things at the right place and in the right way.

Another problem is if input encoding belongs to code transformations
or text tranformations.  Very likely you are right when you say that
after full expansion it's too late and when reading the source file is
too early.  An intermediate step seem more sensible, thus making wrong
the \'e stuff discussed in the recent messages. Another useful addition
could be an ocp aware variant of \edef (or a similar device). And
regarding font transformation, they should be handled by fonts, but
the main problem is that metric information (ie, tfm) cannot be
modified from within TeX, except a few parameters; I really wonder
if allowing more changes, mainly ligatures, is feasible (that
solution would be better than font ocp's and vf's, I think).

>  my requirement for a usable internal representation is that I can take a
>  single element of it at any time and it has a welldefined meaning (and a
>  single one).

Semantically or visually?  Unicode chars have a welldefined semantical
meaning but visually (glyphs) they are undefined, and rendering can be
language dependent.

>> at the LICR level means that the auxiliary files use the Unicode encoding;
>> if the editor is not a Unicode one these files become unmanageable and
>> messy.
>
> not true. the OICR has to be unicode (or more exactly unique and well-defined
> in the above sense, can be 20bits for all i care) if Omega ever should go off
> the ground. but the interface to the external world could apply a well-defined
> output translation to something else before writing.

:-/ I meant from the user's point of view.  (Perhaps the replay was
too quick...)  What I mean is that any LaTeX file ("main" or
auxiliary) should follow the LaTeX systax in a form closer to the
"representation" selected by the user (by "representation" a mean
input encoding and maybe a set of macros).

======
Lars wrote:

> No it wouldn't. If \protect is not \@typeset@protect when \'e is expanded
> then it will be written to a file as \'e.

Right.  Exactly because of that we should not convert text to Unicode
at this stage; otherwise we must change the definition depending on
the file to be read.  We must only move LaTeX code and its context
information without changing  it, so that if it is read correctly in the
main file, it will be read correctly in the auxiliary file.

> Assuming that \InputEncoding is some alias for the \InputTranslation
> primitive, that's roughly what I meant; maybe translation from latin-1 was

Oops!  Sorry.  \InputTranslation is the right command.

> a bit off the target. OTOH you seem to assume below that \InputEncoding
> should also handle translations which are just as untechnical!!?

Not me. Frank. I'm just pointed out the problem, but Frank seems to
be aware of it.

>>The main
>>problem of it is that it doesn't translate macros:
>>\def\myE{ }
>>\InputEncoding <an encoding>
>> \myE
>>
>>only the explicit   is transcoded.
>
> Isn't that a bit like saying "the main problem is that changing the
> \catcode of @ doesn't change the categories of @ tokens in macros"?

It's more like the \uppercase problem above, ie, \lowercase{�\myE}
returns ��, very likely not what we wanted.

>>\terminar{enumeraci�n} % <- that's transcoded using iso hebrew!
>
> But such characters (the Spanish as well as the Hebrew) aren't allowed in
> names in LaTeX!

But they should be allowed in the future in we want a true
multilingual environment.

> E.g. normalization of Unicode is something which should happen on the input
> side, since LaTeX has occationally a need to determine if two pieces of
> text are equal (cf. the xinitials package).

Agreed. See what I say about \edef above.

> It seems to me that what you are trying to do is to use a modified LaTeX
> kernel which still does 8-bit input and output (in particular: it encodes
> every character it puts onto an hlist as an 8-bit quantity) on top of the
> Omega 16-bit (or whatever it is right now) typesetting engine. Whereas this
> is more powerful than the current LaTeX in that it can e.g. do
> language-specific ligature processing without resorting to
> language-specific fonts, it is no better at handling the problems related
> to _multilinguality_ because it still cannot handle character sets that
> spans more than one (8-bit) encoding. How would for example the proposed
> code deal with the (nonsensical but legal) input
>    a\'{e}\k{e}\cyrya\cyrdje\cyrsacrs\cyrphk\textmu?

I don't understand why you say that. In fact I don't undestand what you
say :-) -- it looks very complicated to me. Anyway, it can handle two bits
encodings and uft8, and language style files are written using utf8
(which are directly converted to Unicode without any intermediate
step). Regarding the last line, you can escape the current encoding with
the \unichar macro (which is somewhat tricky to avoid killing
ligatures/kerning). As I say in the readme file, applying that trick
to utf8 didn't work.

Actually, this preliminary lambda doesn't convert \'e to �, but to
e U+0301 (ie, the corresponding combining char). In the internal
Unicode step, accents are normalized in this way and then recombined
by the font ocp. The definition of \' in the la.sd file is very simple:

\DeclareScriptCommand\'[1]{#1\unichar{"0301}}

Very likely, this is one of the parts deserving improvements.

Regards
Javier
___________________________________________________________
Javier Bezos              | TeX y tipografia
jbezos at wanadoo dot es  | http://perso.wanadoo.es/jbezos/
...........................................................
CervanTeX   http://apolo.us.es/CervanTeX/CervanTeX.html

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung