LATEX-L Archives

Mailing list for the LaTeX3 project


Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Condense Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Mailing list for the LaTeX3 project <[log in to unmask]>
Hans Aberg <[log in to unmask]>
Mon, 14 May 2001 12:19:00 +0200
Mailing list for the LaTeX3 project <[log in to unmask]>
text/plain (83 lines)
At 10:10 +0100 2001/05/14, Robin Fairbairns wrote:
>in practice, most people know what encodings their files are in.  and
>if they're into unicode, and encoding in utf-8 or utf-16, the chance
>that they'll also be using another encoding is likely rather small;

The chance of encountering mixed sets of encodings is very big if one picks
down files from the Internet or is maintaining an archive.

In addition one might want to allow the use of mixed encodings in the same
file (like Cyrillic plus Latin).

So whatever scheme you come up, I think it must fulfill the requirement
that the encodings used can somehow easily be identified. The manual
version that you suggest, I think may end up becoming a pain in the

> if
>they're using latin-1 in parallel, it'll be consumed quite happily by
>a utf-8 decoder.  imposing a schema file on *everything* is wild

So allow to set a default encoding different than 32 bit Unicode, then.

>>(If Omega
>> uses C++ for IO, one can use something called a codecvt. Or use pipes,
>> where available.)
>no.  omega does (shame) use clunky old c++ for some parts of its

One should not use old C++, but the current C++ standard

> but it uses its own ocp mechanism for transforming
>encodings.  macro coding to switch ocps at input time is trivial, but
>not attractive for the normal case of using the same encoding all the

At 23:58 +0200 2001/05/13, Lars Hellström wrote:
>Read Sections 8--12 (Section 12 in particular) of the Omega draft
>documentation---that will answer you question more thoroughly that I bother
>to do right now. Marcel's summary contains a reference to it. But in short
>the equivalent functionality is already implemented (without resorting to
>language or platform specific mechanisms such as those you mention).

One reason that Omega is not using C/C++ for code conversions might be that
it apparently is quite difficult:

Most people do not realize that there is no way in C/C++ to ensure that one
writes a byte of 8 bits -- this is in fact platform (or rather, compiler)
dependant. All one knows is that a C/C++ byte has at least 8 bits, even
though most (but not all) would use 8 bit C/C++ bytes.

Also, there is no way to ensure that there is an integral type with 32
bits. And it is (currently) supposed a hell trying to write Unicode
programs on many platforms.

However, I got the following suggestion at the C++ newsgroups for a C++

On each compiler, get hold of an integral type with at least 32 bits, and
use that as your character type in the program. Then, if one knows that the
C/C++ byte is 8 bit (for example by looking at the CHAR_BITS macro), one
also knows that file IO takes place in 8-bit chunks.

Then one evidently can write something called a codecvt (code converter)
that can make the file IO translations transparent to the one writing the
C/C++ program.

One advantage could be speed. And if it is relatively easy to write those
codecvt's (perhaps there are libraries), it becomes easy to add
translations for many different formats without making the C/C++ program
itself any more complicated.

One interesting possibility, which I do not think has been discussed here,
is the ability to read compressed files without unpacking them. One way to
compress a file is to make a statistical character frequency analysis and
then make a variable bit character translation table. If the compression
scheme is right, it might be easy to write a codecvt for that too. (Java
uses .zip files that way, I think, which also allows to bundle files.)

  Hans Aberg