Print

Print


At 14:28 +0100 2001/02/13, Marcel Oliver wrote:
>2. Internal Representation and Output Encoding:
>
>2.1. Problems with Current TeX:
...
>This leads to a number of problems.
>
>- A sufficiently general internal multilingual representation may be
>  impossible to maintain, unless it is Unicode in disguise.

If we are speaking about tweaking TeX's internals, what is needed is a
stream of characters, where the characters can be subjected to various
operations, such as  comparisons, etc. The exact internal representation is
irrelevant.

If the implementation uses say C++, one could easily implement such
characters which polymorphicly can change internal representation. It could
then be mixture of 1-4 byte formats.

However, if one would decide to implement such polymorphism by allocating
each character in separately in free store, it would be slow, and each
character would take up at typically 1 (computer-)word to indicate the size
of the allocation, and the character itself plus word round-off, which is
another word, that is 2 words, or at least 8 bytes for each characters. And
the latest Mac's with G3 & G4 uses 64 and 128 bit words.

So this suggests that what one should use, for the internal representation,
are 32-bit characters, which are encoded in some way making each character
in the semantic sense unique. (That is, if a group of input characters are
to be regarded as a single semantic entity, they should be replaced with a
unique 32-bit code.) -- Space will be enough with today's computers, and
using more compact formats will not be faster, as the CPU's internals will
probably compute in larger words anyway. (That is, if one uses 16-bit
characters, they will probably first be translated into 32-bit words or
larger, the CPU operations will then be performed, and after that,
translated back into 16 bit characters. It will just as fast working with
32-bit characters directly, or perhaps even faster if it decreases the need
of making round-offs. Strictly speaking, which one is the faster can only
be determined by using a profiler, but 32-bit characters seems to be OK.)

  Hans Aberg