LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Classic View Use Monospaced Font Show Text Part by Default Condense Mail Headers
Topic:	[<< First] [< Prev] [Next >] [Last >>]

Sender: Mailing list for the LaTeX3 project <[log in to unmask]>

Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)

From: Hans Aberg <[log in to unmask]>

Date: Sun, 18 Feb 2001 21:18:11 +0100

Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]>

Parts/Attachments: text/plain (47 lines)

Just as an input, here is a system for "chef" preprocessing (i.e., to make
it palatable before it reaches TeX's mouth, in order to avoid internal
indigestion) that comes to my mind:

Every file that UTeX reads, is required to in say the first 4 bytes to have
information about its general encoding, interpretable as ASCII, say (padded
with spaces)
BYTE   -- eight byte mixed encoding
UT8    -- UTF8
UT16   -- UTF16
U16    -- Unicode 16
U32    -- Unicode 32

For the last four, no further information, but in the first case BYTE, one
then has a series of lines indicating, each one indicating an encoding that
might be used and a start sequence. -- It might difficult to foresee a
suitable start sequence for every possible file, so one could allow
individual choices for each file. It could look like:
ASCII   <as>
Latin-1 <we>
Russian <ru>
...
(Or whatever official names one decides to have for the different encodings.)

The preprocessor then zips through the file, looking for the indicated
character combinations, in this case <as> (7 bit), <we> (Western European),
<ru> (Russian), etc., and applies the encodings to the characters that
follow. So for example, one would have to write
  <as> ...
  \def\bar{bla bla \french{<we>f��<as>} bla bla}
as \french does not handle character encodings anymore, but only other
aspects that might related to the use of French in TeX.

It is an inconvenience that one has to do this explicit markup, but it only
happens if one is mixing encodings. In the example above, the ASCII
characters in the bottom 7 bits agree (i.e., have the same Unicode
translation), so it would not have been needed, if one uses Latin-1 all the
time.

One spin-off, though, is that it is fairly easy to convert such files to
Unicode, once Unicode editors become available. -- If, in the example
above, one requires that the <we>...<as> markup being a part of the \french
command, then it becomes more complicated to translate such a file to a
Unicode format.

  Hans Aberg

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung