LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

Options: Use Classic View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Topic: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Hans Aberg <[log in to unmask]>
Tue, 5 Jun 2001 17:21:20 +0200
text/plain (132 lines)
At 13:29 +0100 2001/06/05, Chris Rowley wrote:
>... rather than attempting to categorise the necessary
>information and devise suitable ways to provide it, Frank and I came
>up with the idea of simply supplying a single logical label for every
>ICR string.  Since the first, and still the overwhelmingly most
>diverse, parts of this information came from the needs of multi-lingual
>documents, we called this label the `language' (maybe not a good
>choice).  Our thesis is that `every text string must have a
>language-label'.

So this what I arrived at when playing around with these ideas in my mind:

A "language" is a set of parameters on which the typesetting procedures
depend when typesetting what is considered a human textual language.

For example, US and UK English differ in whether the first quotes should be
``...'' or `...', so if one is supposed to enter quotes as say \quote{...}
and let the language sort it out, US and UK English are different
languages. So first, one may classify some such common languages.

But then it might be possible to customize: For example, in Swedish, one
writes dates as 2001-06-05, but I happen to prefer the format 2001/06/05
(even though I rarely write anything in Swedish). So if dates are entered
say by \date{2001}{06}{05} and the rendering is sorted out by the choice of
language, I have created a new "language" by customizing the Swedish date
format.

The exact implementation is really a question of implementation (which I
describe merely to focus a little on the topic): For the sake of
efficiency, one could decide to have a lookup table over the languages in
use, with keys say a 32-bit number. The key 0 could mean old-TeX
compatibility, keys 1-65536 could be user defined languages (number varying
from document to document), 65537 US English, 65538 UK English, and so on
with other classified languages.

This makes it easy to everywhere light-weight stamp language context, and
also add more language parameters by expanding the language lookup table.

>...In order to distinguish these logical language-labels from anything
>else in the TeX world let us call them LLLs.
...
>-- whenever a character token list (in an ICR) is constructed or
>   moved, then its LLL must go with it;

In addition to merely stamp the language label on a string, I think that
possibly one may have to stack it, that is, if there is a quote of French
within English, then one can from within the French quote know that it is
within an English quote.

>But if you want \foo to be exclusively a bit of Mandarin text then you
>could (or even should) define something like (syntax is probably
>dreadful):
>
>  \newcommand{\foo}{\languageIC{manadrin}{\unichar{<Unicode code>}}}
>
>How clever the expansion of \languageIC needs to be will depend on how
>such input will be used.

My guess is that you are here saying that a language can also restrict the
characters available in it. For example, if somebody tries to use Greek
letters in an English text, something is wrong.

One can also think of having a dictionary that checks for words, and
whenever possible compares it with a given language context. Then one would
get a warning if a French word which is not in the English dictionary
appears in an English context.

>All of the above is completely independent of what input scheme is
>used,

I can think of hybrids, that one has the option to indicate the default
language when opening a file. If this language context is different than
that from which the file was opened, the language contexts merely stack up.

But perhaps this effect can easily be achieved by some other commands.

>IMPORTANT: After that first time input conversion the input encoding
>that was used is unknown and not needed; this is a vital property of
>our ICR model.

This is also the model I have in my mind: Once input has been done and
translated into Unicode+ plus other eventual parameters, the input encoding
becomes a non-issue as far as further TeX/LaTeX processing go.

>1.  How should LICR strings be written out to files used only by LaTeX
>    itself?
>
>2.  How should LICR strings be written out to files read by other
>    applications?
>
>My feeling is that the answer to 1. should, if possible, be something
>independent of any input schemes in use.
>
>It is not so clear that this is possible for 2. and there may be good
>reasons why these two outputs should be the same.

Oops. I did not think about these. But I think that the ideal would be that
1 & 2 are the same.

If language context is stamped everywhere (at least on text), and one
should be able to pick it up again, I see two possible solutions:

One is to define an encoding specifying the hierarchy of languages. For
example (pseudo-code)
  begin_English English text \quote{begin_French French text end_French}
    ... end_English
Here the begin_English, end_English, begin_French, end_French are some file
markers of the encoding scheme chosen (which could something compact, like
special Unicode characters not used for something else).

The other method that comes to my mind is to write two files: One with the
language context unstamped characters, and another indicating the language
contexts, where they start and end in the other file. This is a common way
to handle say styled text at least in old MacOS pre_X (where the additional
information goes into the so called resource fork), but it makes it
virtually impossible for humans to edit it by hand.

>So have I removed the question: "do we need to record the input
>encoding?"?  Or merely cleverly hidden it?

This is the picture I have in my mind too: Once the input has been
processed properly, the input encoding is not anywhere present. The
original TeX (from your discussions here) does not seem to be built really
to handle multiple present input encodings.

But also in other programming, such as C/C++, I think it will be difficult
to handle more than one internal character encoding in the same program. So
from the point of efficiency, I think it is safest to stick to just one
internal character encoding, if possible.

  Hans Aberg

ATOM RSS1 RSS2