LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Classic View Use Monospaced Font Show Text Part by Default Condense Mail Headers
Topic:	[<< First] [< Prev] [Next >] [Last >>]

Sender: Mailing list for the LaTeX3 project <[log in to unmask]>

Subject: Re: Multilingual Encodings Summary 2.2

From: Lars Hellström <[log in to unmask]>

Date: Thu, 10 May 2001 20:59:08 +0200

In-Reply-To: <[log in to unmask]>

Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]>

Parts/Attachments: text/plain (85 lines)

At 19.00 +0200 01-05-10, jbezos wrote:
>Quick answers to a couple of points. Lars says:
>
>>The comparison in Section 3.2.1 of how characters are processed in TeX and
>>Omega respectively also seems strange. In Omega case (b), column C, we see
>>that the LICR character \'e is converted to an 8-bit character "82 before
>>some OTP converts it to the Unicode character "00E9 in column D. Surely
>>this can't be right---whenever LICR is converted to anything it should be
>>to full Unicode, since we will otherwise end up in an encoding morass much
>>worse than that in current LaTeX.
>
>Surely it's right :-). Remember that � is not an active character in
>lambda and that ocp's are applied after expansion. Let's consider
>the input �\'e�. It's expanded to the character sequence "82 "82 "82,
>which is fine. If we define \'e as "00E9 the expansion is "82 "00 "E9
>"82, which is definitely wrong.

Aren't you completely confusing the text file format with the internal
format (OICR or whatever) here? If \'e is defined anything like it is today
(via \DeclareTextComposite) it should expand to a category 11 token with
character code "00E9, i.e., the token ^^^^00e9. I see no indication in the
docs that Omega would convert such a token to two tokens.

>Further, converting the input to Unicode
>at the LICR level means that the auxiliary files use the Unicode encoding;

No it wouldn't. If \protect is not \@typeset@protect when \'e is expanded
then it will be written to a file as \'e.

>if the editor is not a Unicode one these files become unmanageable and messy.
>LICR should preserve, IMO, the current LaTeX conventions, and �\'e�
>should be written to these files in exactly that way.

Well, if you're not naughtily assuming that input encoding equals font
encoding, and hence use inputenc to interpret the non-ASCII-characters
(which I assume were �s when you sent them), then the above text is
_currently_ written to the .aux file as \'e\'e\'e, not as �\'e�.

>Or in other words,
>any file to be read by LaTeX should follow the "external" LaTeX
>conventions and only transcoded in the mouth.
>
>>As I understand the Omega draft documentation, there can be no more than
>>one OTP (the \InputTranslation) acting on the input of LaTeX at any time
>>and that OTP in only meant to handle the basic conversion from the external
>>encoding (ASCII, latin-1, UTF-8, or whatever) to the internal 32-bit
>>Unicode. All this happens way before the input gets tokenized, so there is
>
>In fact, \InputEncoding was not intended for that, but only for
>"technical" translations which applies to the whole document
>as one byte -> two byte or little endian -> big endian.

Assuming that \InputEncoding is some alias for the \InputTranslation
primitive, that's roughly what I meant; maybe translation from latin-1 was
a bit off the target. OTOH you seem to assume below that \InputEncoding
should also handle translations which are just as untechnical!!?

>The main
>problem of it is that it doesn't translate macros:
>\def\myE{ }
>\InputEncoding <an encoding>
> \myE
>
>only the explicit   is transcoded.

Isn't that a bit like saying "the main problem is that changing the
\catcode of @ doesn't change the categories of @ tokens in macros"?

> However, that can be desirable
>under some circumstances, but you know in advance which encodings
>will be used.
That this is known sounds like a very dangerous assumption to me.
>More dangerous is the following:
>
>\comenzar{enumeraci�n} % Spanish interface with, say, MacRoman
>                       % \comenzar means \begin
>\InputEncoding <iso hebrew>
>
>\terminar{enumeraci�n} % <- that's transcoded using iso hebrew!

But such characters (the Spanish as well as the Hebrew) aren't allowed in
names in LaTeX!

Lars Hellstr�m

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung