View:

 Message: [ First | Previous | Next | Last ] By Topic: [ First | Previous | Next | Last ] By Author: [ First | Previous | Next | Last ]

Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)

Date:

Sun, 18 Feb 2001 21:18:11 +0100

text/plain

 Just as an input, here is a system for "chef" preprocessing (i.e., to make it palatable before it reaches TeX's mouth, in order to avoid internal indigestion) that comes to my mind: Every file that UTeX reads, is required to in say the first 4 bytes to have information about its general encoding, interpretable as ASCII, say (padded with spaces) BYTE -- eight byte mixed encoding UT8 -- UTF8 UT16 -- UTF16 U16 -- Unicode 16 U32 -- Unicode 32 For the last four, no further information, but in the first case BYTE, one then has a series of lines indicating, each one indicating an encoding that might be used and a start sequence. -- It might difficult to foresee a suitable start sequence for every possible file, so one could allow individual choices for each file. It could look like: ASCII Latin-1 Russian ... (Or whatever official names one decides to have for the different encodings.) The preprocessor then zips through the file, looking for the indicated character combinations, in this case (7 bit), (Western European), (Russian), etc., and applies the encodings to the characters that follow. So for example, one would have to write    ...   \def\bar{bla bla \french{fôô} bla bla} as \french does not handle character encodings anymore, but only other aspects that might related to the use of French in TeX. It is an inconvenience that one has to do this explicit markup, but it only happens if one is mixing encodings. In the example above, the ASCII characters in the bottom 7 bits agree (i.e., have the same Unicode translation), so it would not have been needed, if one uses Latin-1 all the time. One spin-off, though, is that it is fairly easy to convert such files to Unicode, once Unicode editors become available. -- If, in the example above, one requires that the ... markup being a part of the \french command, then it becomes more complicated to translate such a file to a Unicode format.   Hans Aberg