LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Forum View Use Monospaced Font Show HTML Part by Default Condense Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Sender:	Mailing list for the LaTeX3 project <[log in to unmask]>
Subject:	Re: Multilingual Encodings Summary 2.2
From:	Lars Hellström <[log in to unmask]>
Date:	Sat, 12 May 2001 17:40:32 +0200
In-Reply-To:	<[log in to unmask]>
Reply-To:	Mailing list for the LaTeX3 project <[log in to unmask]>
Parts/Attachments:	text/plain (175 lines)

At 23.24 +0200 01-05-11, Javier Bezos wrote:
>Frank wrote:
>> LaTeX conceptually has only three levels: source, ICR, Output
>
>However, I still think that it's necessary separating code processing
>from text processing.  Both concepts are mixed up by TeX (and
>therefore LaTeX) making, say, uppercasing a tricky thing.  Remember
>that \uppercase only changes chars in the code, and that
>\MakeUppercase first expands the argument and then applies a set of
>transformation (including math but not including text hidden in
>protected macros!).  Well, since ocp's are applied to text after
>expansion (not including math but including actual text even if
>hidden) we are doing things at the right place and in the right way.

The problem with current Omega is that it only provides text processing via
OCPs, but no code processing. Uppercasing as a stylistic variation is
clearly text processing and appears to be handled well (and in the right
place) by the current Omega. With TeX it is best handled using special
fonts, but current LaTeX has no interface for that and it would require a
lot of fonts. If uppercasing is done for some other reason then Omega is no
better than TeX.

>Another problem is if input encoding belongs to code transformations
>or text tranformations.  Very likely you are right when you say that
>after full expansion it's too late and when reading the source file is
>too early.  An intermediate step seem more sensible, thus making wrong
>the \'e stuff discussed in the recent messages. Another useful addition
>could be an ocp aware variant of \edef (or a similar device).

Indeed such a device is needed. Ideally it should work in the mouth (so
that it could be used without messing up the kerning).

>And
>regarding font transformation, they should be handled by fonts, but
>the main problem is that metric information (ie, tfm) cannot be
>modified from within TeX, except a few parameters; I really wonder
>if allowing more changes, mainly ligatures, is feasible (that
>solution would be better than font ocp's and vf's, I think).

I don't understand this. What kind of font transformations are you
referring to?

>>  my requirement for a usable internal representation is that I can take a
>>  single element of it at any time and it has a welldefined meaning (and a
>>  single one).
>
>Semantically or visually?

I suspect Frank considers meaning to be a semantic concept, not a visual.

>>> at the LICR level means that the auxiliary files use the Unicode encoding;
>>> if the editor is not a Unicode one these files become unmanageable and
>>> messy.
>>
>> not true. the OICR has to be unicode (or more exactly unique and
>>well-defined
>> in the above sense, can be 20bits for all i care) if Omega ever should
>>go off
>> the ground. but the interface to the external world could apply a
>>well-defined
>> output translation to something else before writing.
>
>:-/ I meant from the user's point of view.  (Perhaps the replay was
>too quick...)  What I mean is that any LaTeX file ("main" or
>auxiliary) should follow the LaTeX systax in a form closer to the
>"representation" selected by the user (by "representation" a mean
>input encoding and maybe a set of macros).

The problem is that in multilingual documents there may not be a single
such representation---the user can change input encoding just about
anywhere in a document. This is why current LaTeX converts everything to
LICR before it is written to the .aux file: the elements of the input
encoding (as Frank called them above) do not have a single welldefined
meaning. What has been discussed is that one might used some form of
Unicode (most likely UTF-8) in these files instead.

>======
>Lars wrote:
>
>> No it wouldn't. If \protect is not \@typeset@protect when \'e is expanded
>> then it will be written to a file as \'e.
>
>Right.  Exactly because of that we should not convert text to Unicode
>at this stage; otherwise we must change the definition depending on
>the file to be read.

We do already change e.g. the \catcode of @ for when .aux files are read.
Changing the input encoding is much more work but not principally different.

>We must only move LaTeX code and its context
>information without changing  it, so that if it is read correctly in the
>main file, it will be read correctly in the auxiliary file.

I believe one of the main problems for multilinguality in LaTeX today is
that there is no way of recording (or maybe even of determining) the
current context so that this information can be moved around with every
piece of code affected by it. Hence most current commands strive instead to
convert the code to a context-free representation (the LICR) by use of
protected expansion.

>> But such characters (the Spanish as well as the Hebrew) aren't allowed in
>> names in LaTeX!
>
>But they should be allowed in the future in we want a true
>multilingual environment.

Why? They are not part of any text, but part of the markup!

>> It seems to me that what you are trying to do is to use a modified LaTeX
>> kernel which still does 8-bit input and output (in particular: it encodes
>> every character it puts onto an hlist as an 8-bit quantity) on top of the
>> Omega 16-bit (or whatever it is right now) typesetting engine. Whereas this
>> is more powerful than the current LaTeX in that it can e.g. do
>> language-specific ligature processing without resorting to
>> language-specific fonts, it is no better at handling the problems related
>> to _multilinguality_ because it still cannot handle character sets that
>> spans more than one (8-bit) encoding. How would for example the proposed
>> code deal with the (nonsensical but legal) input
>>    a\'{e}\k{e}\cyrya\cyrdje\cyrsacrs\cyrphk\textmu?
>
>I don't understand why you say that.

Because of the example in the summary:
====================================================
          A       B         C       D         E
----------------------------------------------------
TeX   a)   "82     \'e      *   - - - - - >   "E9
      b)   \'e     \'e      *   - - - - - >   "E9
      c)   "82     "82      *   - - - - - >   "82
====================================================
Omega a)  "82     "82      "82     "00E9      "E9
      b)  \'e     \'e      "82     "00E9      "E9

The last line shows \'e being converted to an 8-bit quantity "82
(appearently the input encoding equivalent) before it is converted to
Unicode. LaTeX lives between columns A and C, so there is no hint of any
non-8-bit processing being done.

>In fact I don't undestand what you
>say :-) -- it looks very complicated to me. Anyway, it can handle two bits
>encodings and uft8, and language style files are written using utf8
>(which are directly converted to Unicode without any intermediate
>step).

That's what I would have expected, but the example gives no hint of this
either.

>Regarding the last line, you can escape the current encoding with
>the \unichar macro (which is somewhat tricky to avoid killing
>ligatures/kerning). As I say in the readme file, applying that trick
>to utf8 didn't work.

Isn't the \char primitive in Omega be able to produce arbitrary characters
(at least arbitrary characters in the basic multilingual plane)?

>Actually, this preliminary lambda doesn't convert \'e to �, but to
>e U+0301 (ie, the corresponding combining char). In the internal
>Unicode step, accents are normalized in this way and then recombined
>by the font ocp. The definition of \' in the la.sd file is very simple:
>
>\DeclareScriptCommand\'[1]{#1\unichar{"0301}}
>
>Very likely, this is one of the parts deserving improvements.

It looks quite reasonable to me, and it is certainly much better than the
processing depicted in the example. Does this mean that the example should
rather be

    A     B        C          D        E
   \'e   \'e   e^^^^0301   ^^^^00e9   ^^e9

(using the ^^ notation for non-ASCII characters)?

Lars Hellstr�m

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung