LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Forum View Use Monospaced Font Show HTML Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)
From:	Hans Aberg <[log in to unmask]>
Reply To:	Mailing list for the LaTeX3 project <[log in to unmask]>
Date:	Sat, 17 Feb 2001 17:24:45 +0100
Content-Type:	text/plain
Parts/Attachments:	text/plain (61 lines)

At 14:20 +0100 2001/02/17, Frank Mittelbach wrote:

There appears to be two variations, one based on the original TeX, and one
with TeX having some kind of extensions.

As for the second approach, it seems me that the internal representation
should be 32-bit Unicode. As TeX does not seem well equipped handling the
encoding issues, one should then hook up a preprocessor providing the
suitable translations. Thus
    whatever encoding -> preprocessor -> UTeX
This easy-to-write preprocessor can combine combining characters to single
Unicode characters, if possible, or otherwise write them on a form that
UTeX easily can handle, say by switching from postfix to prefix notation,
or whatever. With further tweaking of the TeX engine it could even combine
TeX combinations such as "--", "---" into single Unicode characters.

It is further easy to write such translators for reading/writing to files.

Thus the picture becomes

                         - trans C (eg Uppercasing)
                         |      |
                         |      | Hardwired translations
                         V      |
 whatever --> decode --> Unicode ------->  trans B  --> ^^e9
                         ^      |
                         |      |
                         |      |
                       files: Choice of say utf8, Unicode

> > Omega can represent internally non ascii chars and hence
> > actual chars are used instead of macros (with a few exceptions).
> > Trivial as it can seem, this difference is in fact a HUGE
> > difference. For example, the path followed by � will be:
> >
> >  � --an encoding ocp-|           |-- T1 font ocp-->  ^^e9
> >                      +-> U+00E9 -+
> >  \'e -fontenc (!)----|           |- OT1 font ocp -> \OT1\'{e}

Thus, one does not bother representing ASCII or any other such encoding
internally anymore. Combinations such as \'e are no longer necessary,
except for backwards compatibility. It will probably not make much
difference what you do with those combinations, because documents that for
archival purposes need absolute compatibility can use the old TeX -- newer
documents can be translated, if necessary, at need. (That is, this will
work, if the incompatibilities are not too big.)

> trans A = tokenising 8 bit numbers as the corresponding 16bit numbers
>           Example:
>           if � was in the cp437 code page (German DOS) it would
>           be the 8bit char "82; that would become the 16bit token with
>           number "0082 (which is NOT � in unicode = "00E9)
>
>           if on the other hand  � was in latin1 (where it is "E9) we
>           get "00E9

Such problems would not happen within UTeX, as the preprocessor would
provide the proper translation to \U000000E9.

  Hans Aberg

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung