Marcel wrote (very wisely): > I'd like to bring the discussion back to the ICR issue, in > particular how a hypothetical successor to TeX should handle > input encodings. The subsequent discussion led me to write the following, which does not say in fact say much about input encodings; I hope that the reason for this omission becomes clear to those who read further. chris Part the First: First, I regret to say, some general theory about text characters and languages. Note that the ICR is a Character Representation, in the sense defined in the Unicode standard, and not a glyph representation or an Enhanced Character Representation, ie one in which each `character' contains more information than simply its name(encoded). A string in such an ICR is useful because it is a standardised representation in which each character representation has a fixed meaning \emph{as a character}. BUT, for all the many reasons given in this discussion and several others, having nothing more than such strings is inadequate for many of the sub-processes involved in a typical document-processing system. A complete analysis of these processes and the information they require is probably not feasible and certainly not useful since they change rapidly with the requirements of document processing. The only thing that can be said is that further information about these strings must be accessible by the sub-processes of the system. Therefore, rather than attempting to categorise the necessary information and devise suitable ways to provide it, Frank and I came up with the idea of simply supplying a single logical label for every ICR string. Since the first, and still the overwhelmingly most diverse, parts of this information came from the needs of multi-lingual documents, we called this label the `language' (maybe not a good choice). Our thesis is that `every text string must have a language-label'. The only property these labels need (and indeed are able) to have is that they \emph{can} help any application or sub-process to access the information it needs to process that text string. From this it follows that all applications and sub-processes must also ensure that this language-label is preserved with all text strings to which it applies (ie no text string ever goes anywhere without its language-label). In a (near) mono-lingual document it may not be necessary to explicitly supply the label if that label is the document's main language but this is a (possibly non-robust) implementation efficiency. [In order to distinguish these logical language-labels from anything else in the TeX world let us call them LLLs.] In the context of current TeX-related systems this means that: -- whenever a character token list (in an ICR) is constructed or moved, then its LLL must go with it; Part the Second: So what does this mean for an ICR for a potential TeX/Omega? First the simple case, in which everything that is encoded in the ICR is `strongly internal': it is never seen in external files or passed to external applications. It means that all ICR strings, and hence all token lists potentially containing text, must have an LLL. Thus to take one small example, based on Javier's: \newcommand{\foo}{\unichar{<Unicode code>}} is what is needed simply to get something that expands to the single ICR character : \unichar{<Unicode code>} or it may expand further to a single (Unicode-encoded character) token (the choice being made according to whether one is still restricted to 8-bit internals etc). Since this character is used in widely distinct languages, in order to use this within Mandarin you would then, I guess, need to put something like this in your document: \begin{mandarin} ... \foo{} ... \end{mandarin} or \mandarintext {... \foo{} ...}. But if you want \foo to be exclusively a bit of Mandarin text then you could (or even should) define something like (syntax is probably dreadful): \newcommand{\foo}{\languageIC{manadrin}{\unichar{<Unicode code>}}} How clever the expansion of \languageIC needs to be will depend on how such input will be used. Part the Third: All of the above is completely independent of what input scheme is used, except that defining things like \foo defines part of a 7-bit input scheme. All such inputs, and all other input encodings, must at first input time be normalised to their unique representation in the ICR(*). To this ICR representation of a text string, an LLL will need to be added, explicitly or implicitly. IMPORTANT: After that first time input conversion the input encoding that was used is unknown and not needed; this is a vital property of our ICR model. (*) This could be done by several methods, depending on the application doing it; within Omega the following will, we hope, be available: input-translation; expansion; character-token-list OTPs. Part the Fourth: Unfortunately, in a multi-pass system such as LaTeX the LICR is not `strongly internal': information that is internal to LaTeX must be stored in external files. If these could be made effectively internal then this would not be a problem but there are two difficulties with that approach: -- although not explicitly, current LaTeX makes these files readable and editable (at least by English users:-); -- very similar information is written out for use by other applications. [Note that here I am not thinking of writing to the terminal since that is a one-way process and the output is useless if not immediately comprehensible (it is thus a really tricky problem). The status of the log file is unclear (I would perhaps treat it like the terminal).] Thus this leads directly to these questions (two, since different choices may be needed): 1. How should LICR strings be written out to files used only by LaTeX itself? 2. How should LICR strings be written out to files read by other applications? My feeling is that the answer to 1. should, if possible, be something independent of any input schemes in use. It is not so clear that this is possible for 2. and there may be good reasons why these two outputs should be the same. Part the Fifth: So have I removed the question: "do we need to record the input encoding?"? Or merely cleverly hidden it?