Frank wrote: >you are > looking at the thing from the current omega implementation. I think this single sentence summarizes the actual situation. And considering that, I didn't make a bad job after all ;-) Most of your reservations are related to the way omega works, and in fact I paused the development of lambda just because of that -- it's obvious that I can only work with known things, and the future omega is still an unknown. > LaTeX conceptually has only three levels: source, ICR, Output However, I still think that it's necessary separating code processing from text processing. Both concepts are mixed up by TeX (and therefore LaTeX) making, say, uppercasing a tricky thing. Remember that \uppercase only changes chars in the code, and that \MakeUppercase first expands the argument and then applies a set of transformation (including math but not including text hidden in protected macros!). Well, since ocp's are applied to text after expansion (not including math but including actual text even if hidden) we are doing things at the right place and in the right way. Another problem is if input encoding belongs to code transformations or text tranformations. Very likely you are right when you say that after full expansion it's too late and when reading the source file is too early. An intermediate step seem more sensible, thus making wrong the \'e stuff discussed in the recent messages. Another useful addition could be an ocp aware variant of \edef (or a similar device). And regarding font transformation, they should be handled by fonts, but the main problem is that metric information (ie, tfm) cannot be modified from within TeX, except a few parameters; I really wonder if allowing more changes, mainly ligatures, is feasible (that solution would be better than font ocp's and vf's, I think). > my requirement for a usable internal representation is that I can take a > single element of it at any time and it has a welldefined meaning (and a > single one). Semantically or visually? Unicode chars have a welldefined semantical meaning but visually (glyphs) they are undefined, and rendering can be language dependent. >> at the LICR level means that the auxiliary files use the Unicode encoding; >> if the editor is not a Unicode one these files become unmanageable and >> messy. > > not true. the OICR has to be unicode (or more exactly unique and well-defined > in the above sense, can be 20bits for all i care) if Omega ever should go off > the ground. but the interface to the external world could apply a well-defined > output translation to something else before writing. :-/ I meant from the user's point of view. (Perhaps the replay was too quick...) What I mean is that any LaTeX file ("main" or auxiliary) should follow the LaTeX systax in a form closer to the "representation" selected by the user (by "representation" a mean input encoding and maybe a set of macros). ====== Lars wrote: > No it wouldn't. If \protect is not \@typeset@protect when \'e is expanded > then it will be written to a file as \'e. Right. Exactly because of that we should not convert text to Unicode at this stage; otherwise we must change the definition depending on the file to be read. We must only move LaTeX code and its context information without changing it, so that if it is read correctly in the main file, it will be read correctly in the auxiliary file. > Assuming that \InputEncoding is some alias for the \InputTranslation > primitive, that's roughly what I meant; maybe translation from latin-1 was Oops! Sorry. \InputTranslation is the right command. > a bit off the target. OTOH you seem to assume below that \InputEncoding > should also handle translations which are just as untechnical!!? Not me. Frank. I'm just pointed out the problem, but Frank seems to be aware of it. >>The main >>problem of it is that it doesn't translate macros: >>\def\myE{ } >>\InputEncoding <an encoding> >> \myE >> >>only the explicit is transcoded. > > Isn't that a bit like saying "the main problem is that changing the > \catcode of @ doesn't change the categories of @ tokens in macros"? It's more like the \uppercase problem above, ie, \lowercase{É\myE} returns éÉ, very likely not what we wanted. >>\terminar{enumeraciÛn} % <- that's transcoded using iso hebrew! > > But such characters (the Spanish as well as the Hebrew) aren't allowed in > names in LaTeX! But they should be allowed in the future in we want a true multilingual environment. > E.g. normalization of Unicode is something which should happen on the input > side, since LaTeX has occationally a need to determine if two pieces of > text are equal (cf. the xinitials package). Agreed. See what I say about \edef above. > It seems to me that what you are trying to do is to use a modified LaTeX > kernel which still does 8-bit input and output (in particular: it encodes > every character it puts onto an hlist as an 8-bit quantity) on top of the > Omega 16-bit (or whatever it is right now) typesetting engine. Whereas this > is more powerful than the current LaTeX in that it can e.g. do > language-specific ligature processing without resorting to > language-specific fonts, it is no better at handling the problems related > to _multilinguality_ because it still cannot handle character sets that > spans more than one (8-bit) encoding. How would for example the proposed > code deal with the (nonsensical but legal) input > a\'{e}\k{e}\cyrya\cyrdje\cyrsacrs\cyrphk\textmu? I don't understand why you say that. In fact I don't undestand what you say :-) -- it looks very complicated to me. Anyway, it can handle two bits encodings and uft8, and language style files are written using utf8 (which are directly converted to Unicode without any intermediate step). Regarding the last line, you can escape the current encoding with the \unichar macro (which is somewhat tricky to avoid killing ligatures/kerning). As I say in the readme file, applying that trick to utf8 didn't work. Actually, this preliminary lambda doesn't convert \'e to é, but to e U+0301 (ie, the corresponding combining char). In the internal Unicode step, accents are normalized in this way and then recombined by the font ocp. The definition of \' in the la.sd file is very simple: \DeclareScriptCommand\'[1]{#1\unichar{"0301}} Very likely, this is one of the parts deserving improvements. Regards Javier ___________________________________________________________ Javier Bezos | TeX y tipografia jbezos at wanadoo dot es | http://perso.wanadoo.es/jbezos/ ........................................................... CervanTeX http://apolo.us.es/CervanTeX/CervanTeX.html