LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Multilingual Encodings Summary 2.2
From:	Frank Mittelbach <[log in to unmask]>
Reply To:	Mailing list for the LaTeX3 project <[log in to unmask]>
Date:	Thu, 10 May 2001 20:59:31 +0200
Content-Type:	text/plain
Parts/Attachments:	text/plain (145 lines)

Javier wrote in reply to Lars:

 > Quick answers to a couple of points. Lars says:
 >
 > >The comparison in Section 3.2.1 of how characters are processed in TeX and
 > >Omega respectively also seems strange. In Omega case (b), column C, we see
 > >that the LICR character \'e is converted to an 8-bit character "82 before
 > >some OTP converts it to the Unicode character "00E9 in column D. Surely
 > >this can't be right---whenever LICR is converted to anything it should be
 > >to full Unicode, since we will otherwise end up in an encoding morass much
 > >worse than that in current LaTeX.

in my opinion this whole section is incorrect too or say at best half correct
(sorry) --- however the problem really is that the whole area inside omega
isn't at the current point in time anywhere consistent due to the fact that
the OTPs have been hooked into the wrong places (this is due to the technical
ease to open up the points where the code was opened up and this near
impossibility to do it elsewhere without rewriting the whole of TeX, and due
to the fact that originally the whole method was intended for far simpler
tasks and not for a grande picture)

conceptually LaTeX has a well-defined ICR (though with a somewhat clumsy
implementation due to the technical limitations of TeX) while at the current
point in time Omega hasn't such a beast.

For LaTeX the line c) in your table simply doesn't exist (it is not supported
code) and  the columns actually do not make much sense  for LaTeX as they only
reflect the missing concept of an internal encoding in Omega and you are
looking at the thing from the current omega implementation.

LaTeX conceptually has only three levels: source, ICR, Output

and something like the step step C lives along the way from ICR -> Output but is a
only a technicality which is not of conceptual importance. All the reasoning
and manipulation of text is done in only one form which is the ICR. and step D
is in LaTeX the transformation from source to LICR via inputenc

For omega one would expect that to be the same except that the OICR would be
something like U+00E --- but it isn't: as it takes a long while for text to get
to this form  (if ever!!!!!).

you can say it differently as follows:

 my requirement for a usable internal representation is that I can take a
 single element of it at any time and it has a welldefined meaning (and a
 single one).

now for the LICR this is the case but for Omega it is (right now) not.


as a result one ends up to have to explain all those problems of
misinterpreting the internal forms if you do this or that at a certain stage
(like storing text in a token register and reusing it at some other point or
never pass it to the hlist builder (where the OTPs actually execute)


from your second ascii drawing in that section one would get the impression
(for a moment) that Omega has a welldefined OICR which is U+00E9 but as we
know this is unfortunately not the case --- though it should be!!!
(and to be honnest to see the word "fontenc" on the left makes me shudder
though I understand why Javier put it there originally; I think it is a
horrible misinterpretation of what fontenc conceptually does)

 > Surely it's right :-). Remember that � is not an active character in
 > lambda and that ocp's are applied after expansion. Let's consider

but ocp's should work on OICR and not on undefined byte sequences!
like here:

 > the input �\'e�. It's expanded to the character sequence "82 "82 "82,
 > which is fine.

which is not fine, not fine at all
because of this:

 > If we define \'e as "00E9 the expansion is "82 "00 "E9
 > "82, which is definitely wrong. Further, converting the input to Unicode

not the latter is wrong but the whole thing is wrong

 > at the LICR level means that the auxiliary files use the Unicode encoding;
 > if the editor is not a Unicode one these files become unmanageable and
 > messy.

not true. the OICR has to be unicode (or more exactly unique and well-defined
in the above sense, can be 20bits for all i care) if Omega ever should go off
the ground. but the interface to the external world could apply a well-defined
output translation to something else before writing.

that could be utf8, but in fact that could be anything as long as it definable
and controllable so that you know what this file ends up with so inputting it
back again would result in turning it right back into the OICR


 > LICR should preserve, IMO, the current LaTeX conventions, and �\'e�
 > should be written to these files in exactly that way.

not sure what you mean by "current LaTeX conventions": current LaTeX
conventions is that external files are always written in a special 7bit
representation of the LICR (involving things like \IeC). not wonderful but
conceptually clean.


 > Or in other words,
 > any file to be read by LaTeX should follow the "external" LaTeX
 > conventions and only transcoded in the mouth.

????

 > >As I understand the Omega draft documentation, there can be no more than
 > >one OTP (the \InputTranslation) acting on the input of LaTeX at any time
 > >and that OTP in only meant to handle the basic conversion from the external
 > >encoding (ASCII, latin-1, UTF-8, or whatever) to the internal 32-bit
 > >Unicode. All this happens way before the input gets tokenized, so there is
 >
 > In fact, \InputEncoding was not intended for that, but only for
 > "technical" translations which applies to the whole document
 > as one byte -> two byte or little endian -> big endian. The main
 > problem of it is that it doesn't translate macros:
 > \def\myE{�}
 > \InputEncoding <an encoding>
 > �\myE

\InputEncoding is the point where one need to go from external source encoding
to OICR that is precisely the wound: the current \InputEncoding isn't doing
this job fully (and that it is not clear how to do it properly (to be fair))

but in my opinion it is absolutely essential that this all gets detangled and
Omega ends up with a proper OICR model. Only then, it could become usable in a
broader sense in my opinion.

cheers
frank

ps: I would really like to thank Oliver a lot for doing this compilation. The
fact that we don't agree with some points in it only means that the processes
are so complicated that we haven't yet understood them properly and so need to
work further on them (and a document like this does help)
pps: what might help as well is to identify the parts we do feel be
controverse and actually mark them (perhaps with some marginal
notes\marginpar{FMi: bla bla}\marginpar{JLo: Fmi talks rubbish}\marginpar{LHe:
they both seem to have no idea what they are talking about} :-)
ppps: i'm off to GUTenberg so don't be surprised if flaming replies go
unansered by me for a while

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung