At 13:29 +0100 2001/06/05, Chris Rowley wrote: >... rather than attempting to categorise the necessary >information and devise suitable ways to provide it, Frank and I came >up with the idea of simply supplying a single logical label for every >ICR string. Since the first, and still the overwhelmingly most >diverse, parts of this information came from the needs of multi-lingual >documents, we called this label the `language' (maybe not a good >choice). Our thesis is that `every text string must have a >language-label'. So this what I arrived at when playing around with these ideas in my mind: A "language" is a set of parameters on which the typesetting procedures depend when typesetting what is considered a human textual language. For example, US and UK English differ in whether the first quotes should be ``...'' or `...', so if one is supposed to enter quotes as say \quote{...} and let the language sort it out, US and UK English are different languages. So first, one may classify some such common languages. But then it might be possible to customize: For example, in Swedish, one writes dates as 2001-06-05, but I happen to prefer the format 2001/06/05 (even though I rarely write anything in Swedish). So if dates are entered say by \date{2001}{06}{05} and the rendering is sorted out by the choice of language, I have created a new "language" by customizing the Swedish date format. The exact implementation is really a question of implementation (which I describe merely to focus a little on the topic): For the sake of efficiency, one could decide to have a lookup table over the languages in use, with keys say a 32-bit number. The key 0 could mean old-TeX compatibility, keys 1-65536 could be user defined languages (number varying from document to document), 65537 US English, 65538 UK English, and so on with other classified languages. This makes it easy to everywhere light-weight stamp language context, and also add more language parameters by expanding the language lookup table. >...In order to distinguish these logical language-labels from anything >else in the TeX world let us call them LLLs. ... >-- whenever a character token list (in an ICR) is constructed or > moved, then its LLL must go with it; In addition to merely stamp the language label on a string, I think that possibly one may have to stack it, that is, if there is a quote of French within English, then one can from within the French quote know that it is within an English quote. >But if you want \foo to be exclusively a bit of Mandarin text then you >could (or even should) define something like (syntax is probably >dreadful): > > \newcommand{\foo}{\languageIC{manadrin}{\unichar{<Unicode code>}}} > >How clever the expansion of \languageIC needs to be will depend on how >such input will be used. My guess is that you are here saying that a language can also restrict the characters available in it. For example, if somebody tries to use Greek letters in an English text, something is wrong. One can also think of having a dictionary that checks for words, and whenever possible compares it with a given language context. Then one would get a warning if a French word which is not in the English dictionary appears in an English context. >All of the above is completely independent of what input scheme is >used, I can think of hybrids, that one has the option to indicate the default language when opening a file. If this language context is different than that from which the file was opened, the language contexts merely stack up. But perhaps this effect can easily be achieved by some other commands. >IMPORTANT: After that first time input conversion the input encoding >that was used is unknown and not needed; this is a vital property of >our ICR model. This is also the model I have in my mind: Once input has been done and translated into Unicode+ plus other eventual parameters, the input encoding becomes a non-issue as far as further TeX/LaTeX processing go. >1. How should LICR strings be written out to files used only by LaTeX > itself? > >2. How should LICR strings be written out to files read by other > applications? > >My feeling is that the answer to 1. should, if possible, be something >independent of any input schemes in use. > >It is not so clear that this is possible for 2. and there may be good >reasons why these two outputs should be the same. Oops. I did not think about these. But I think that the ideal would be that 1 & 2 are the same. If language context is stamped everywhere (at least on text), and one should be able to pick it up again, I see two possible solutions: One is to define an encoding specifying the hierarchy of languages. For example (pseudo-code) begin_English English text \quote{begin_French French text end_French} ... end_English Here the begin_English, end_English, begin_French, end_French are some file markers of the encoding scheme chosen (which could something compact, like special Unicode characters not used for something else). The other method that comes to my mind is to write two files: One with the language context unstamped characters, and another indicating the language contexts, where they start and end in the other file. This is a common way to handle say styled text at least in old MacOS pre_X (where the additional information goes into the so called resource fork), but it makes it virtually impossible for humans to edit it by hand. >So have I removed the question: "do we need to record the input >encoding?"? Or merely cleverly hidden it? This is the picture I have in my mind too: Once the input has been processed properly, the input encoding is not anywhere present. The original TeX (from your discussions here) does not seem to be built really to handle multiple present input encodings. But also in other programming, such as C/C++, I think it will be difficult to handle more than one internal character encoding in the same program. So from the point of efficiency, I think it is safest to stick to just one internal character encoding, if possible. Hans Aberg