## LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

#### View:

 Message: [ First | Previous | Next | Last ] By Topic: [ First | Previous | Next | Last ] By Author: [ First | Previous | Next | Last ] Font: Proportional Font

Subject:

Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)

From:

Date:

Sat, 17 Feb 2001 14:20:55 +0100

Content-Type:

text/plain

Parts/Attachments:

 text/plain (245 lines)
 A few people will unfortunately get this posting twice since it is both sent to LATEX-L as well as to the Omega developers (several of which are on LaTeX-L), sorry for that. We thought this advisable as we make a number of suggestions regarding extensions/changes to Omega's character token processing. (Any technical discusion of these suggestions should probably be confined to the omega developers list though) In the disucssion below LICR stands for LaTeX Internal Character Representation. --------------------------------------------- Javier, I use 'we' a lot here because Chris is looking over my shoulder as I type and I am pretending that he agress with me:-). [meta remark: this was typed by Chris -Frank]  > Which is the purpose of the LICR? Apparently, it's only an  > intermediate step before creating the final output. That that is not at all the way we understand this process. what i'm refering to is the "only" in your statement. clearly it is a step in the sequence from source to final output. The LICR is the representation that is to be used when, as Chris put it, LaTeX reasons about the character data and manipulates it. Part of this process is rearanging data and adding additional information to it. For example the collection of a TOC is something we think should happen while LaTeX keeps all its data in LICR form. As a consequence of this we consider writing to files with the purpose of rereading that file back in is something that has to happen within the LICR context since only then can LaTeX reprocess this data properly. (There are other forms of writing to files, or to the terminal, where LaTeX should (conceptually) leave the LICR and convert the data to a suitable output representation). So our model is something like this                          - trans C (eg Uppercasing)                          | |                          | | (eg 8bit,utf8) V |   é --- trans A --> LICR +------> trans B --> ^^e9                          ^ |                          | |                          | |                        trans D (eg generating TOC)  with current LATeX based on TeX we have  trans A = the inputenc method  LICR = LaTeX Internal Character Representation            is a unique representation of characters by 7bit charcater tokens            plus expansion invariant cs-names (which in extenal files are also            represented by 7bit strings)  trans B = fontenc translations when typesetting (producing hlists)  trans C = \MakeUppercase etc  trans D = writing to .aux files and reading them back in            puttings things into marks and manipulating them,            etc so entering the LICR is done via one process (trans A) and then all the reasoning and data manipulation happens within the LICR context and only the at the final stage do we leave the LICR, eg you typeset something (aka spots on paper :-) or you display a message on the terminal due to TeX limitations some of the system output (eg log file) can't be fully controlled (eg overfull hbox is displaying data from after trans B instead of displaying that with a transformation suitable to the target device).  > can be true in TeX, but not in Omega because the LICR can  > be processed by external tools (spelling, syntax, etc.)  > There are lots of tools using Unicode and very likely there  > will be more in a future. However, there are only a handful  > of tools understanding the current LICR and it's unlikely it is true that a) Omega does offer more general support for manipulating the data b) external tools that directly understand the LICR will be few.  > there will be more (they are eventually expanded and therefore  > cannot be processed anyway, the very fact that unicode chars  > are actual letter' chars is critical). that is not our understanding however. transformation of LICR is supposed to happen only when leaving its domain eg for the final typesetting step. Having an LICR that is unicode chars clearly makes it simpler for an external tool to manipulate data and send it back to the system; but there is nothing generally restricting about the LICR (in current LaTeX) being not just unicode characters. All that is needed is to provide the external tool with a translation to understand the data.  > So, having true  > Unicode text (perhaps with tags, which can be removed if  > necessary) at some part of the internal processing is imo  > an essential feature in future extensions to TeX. agreed  > And indeed  > Omega is an extension which can cope with that; I wouldn't like  > renounce that. we think not (yet) at least not the code that is currently available (to us --- from the texlive 5d CD) you wrote:  > Omega can represent internally non ascii chars and hence  > actual chars are used instead of macros (with a few exceptions).  > Trivial as it can seem, this difference is in fact a HUGE  > difference. For example, the path followed by é will be:  >  > é --an encoding ocp-| |-- T1 font ocp--> ^^e9  > +-> U+00E9 -+  > \'e -fontenc (!)----| |- OT1 font ocp -> \OT1\'{e} what you are describing there is, in our understanding, effectively a replacement for trans B in our above diagram, ie our understanding of what is possible in Omega currently is roughly looking like this:                          ------- --trans C                          | | | |                          | | | | (8 bit number) V | (produce hlist) V |   é --- trans A ---> OICR1 +--- trans B ------> OICR2 +--> trans E                          ^ |                          | |                          | |                        trans D (eg generating TOC)  trans A = tokenising 8 bit numbers as the corresponding 16bit numbers            Example:            if é was in the cp437 code page (German DOS) it would            be the 8bit char "82; that would become the 16bit token with            number "0082 (which is NOT é in unicode = "00E9)            if on the other hand é was in latin1 (where it is "E9) we            get "00E9            if the input was in utf8 you would not get unicode chars as            the result but sequences of 16bit chars all starting with "00            --- so unicode charcater that is multibyte in utf8 would not            become the corect unicode 16-bit token but would become a sequnce            of tokens each of the form "00  OICR1 = Omega Internal Character Representation 1:            16bit representation of characters for which without additional            external information one can't tell which character is refered to.  trans B = process when Omega is producing hlists, ie only when it forms            paragraphs or hboxes. Only after that point, or rather while doing            that, can ocps be used to transform the OICR1 further. To turn it            into OICR2 one would at this stage apply what you called "an            encoding ocp" and/or transform commands from the LICR, eg \'e to            OICR2 ie a unicode char. But to be able to transform OICR1 into            OICR2 you need to have the original encoding information still            present  OICR2 = Omega Internal Character Representation 2:            16bit representation of characters as unicode positions (or so we            hope if the transformation from OICR1 to OICR2 worked)  trans C = As an example, \MakeUppercase now works on the OICR2 representation            as an ocp; it is an interesting question whether it should happen            this late in the process (on typeset stuff) (By the way, the            primitive \uppercase would as we think work on the tokens before            producing the hlist, ie on OICR1 tokens)  trans D = writing to .aux files and reading them back in; puttings things            into marks and manipulating them, etc. All these transformations            work on OICR1, see below. Discussion: =========== The problem really is transforms of type D which are using OICR1 and are thus likely to break in the sense that their encoding information is lost in the process. So we think that the translation process from external source data in some encoding to the OICR should happen not at trans B via encoding ocps but at trans A so that OICR1 = OICR2. Note that this translation process from external encoding to OICR would work on streams and not on finite (token) lists so it should have slightly different characteristics compared with ocps. We are sure it will be difficult to provide control of such a translation process at trans A if the control should be from within the source document and usable by authors and or packages, eg changing the input encoding midway in an argument could have similar restrictions to say, \catcode changes in such places. E.g, you couldn't do  \def\french#1{{\inputencoding{latin1}#1}} because then for  bla bla \french{foo} bla bla the input encoding change would not be noticed until after "foo" has already been tokenised (incorrectly) --- yes, we know that this example could be made to work using Don's \footnote trick but as with LaTeX's \verb there will be situations in which even more elaborate implementations will still fail due to tokenisation happening before any macro expansion is possible. Another problem of the current model seems to be that, even if trans A did the encoding transformation to Unicode ie we have only a single OICR, transformations of type D (ie transformation of character token strings) can't be controlled by a mechanism similar to the one that is available for transformations of type C, ie in one case we have ocps and in the other area, when we work on structural issues like building TOC or arranging data for page representation no such mechanism is available. Thus is seems interesting to think about whether or not a similar concept (not necessarily the same!) should be made available for this part of the process. In other words the concept of ocps makes perfect sense for character string manipulation but one has to [pretend] to typeset something to have them available in current Omega, but a large amount of document processing is concerned with character string manipulation not related to typesetting at all. As a small example, when displaying an error message and error message (transform of type D) one should transform character data from OICR back to the encoding used by the (OS interface to the) display device. I hope this explains a bit more about our understanding of the LICR and how we think it could be generalised for a system that internally uses Unicode characters and string transformation processes. cheers Frank (with Chris editing and criticising:-)`