## LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

 Options: Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers Message: [<< First] [< Prev] [Next >] [Last >>] Topic: [<< First] [< Prev] [Next >] [Last >>] Author: [<< First] [< Prev] [Next >] [Last >>]

 Subject: Re: Strings, and regular expressions From: Bruno Le Floch <[log in to unmask]> Reply To: Mailing list for the LaTeX3 project <[log in to unmask]> Date: Thu, 13 Oct 2011 05:56:14 -0400 Content-Type: text/plain Parts/Attachments: text/plain (71 lines)
```> Afaik that is exactly the problem, that there is no explanation
> (=documenation/specification) on the XeTeX side. In most cases
> the contents of \special expects bytes, the conversion of bytes
> with 8 bits to UTF-8 destroys the contents.

Right, so I'll have to look at the precise usage you make of this in

>> Putting non-basic encodings in .def files is probably the best
>> approach indeed. The time it takes for me to code all that is probably
>> the only disadvantage, since that means postponing the floating point
>> module.
>
> I wouldn't do it manually. There are mappings files for Unicode:
>   http://unicode.org/Public/MAPPINGS/
> In project I am using these mappings together with a perl script
> to generate the .def files.

Thank you for the link. It seems that the simplest would be to
directly use the tables provided there as the .def files. Simply
\catcode`\#=14, and set a few other default catcodes, then input the
file, looping over the lines. Are all of the lines of the form

0xHH    0xHHHH    # comment

(or comment lines), with H = some hexadecimal digit? In other words,
are all those encodings 8-bit only, and with only Unicode points
<65536?

> Of course not, size is 2N.
> * All safe characters could be used, then the size decreases
>   (e.g. ASCII85, ...). But the problem is to find safe characters.
>   In especially this set might change.
> * Some kind of compression could be applied.

Right. I was mostly thinking of speed, with my comments on \lowercase.
Space is not an issue internally to a TeX run (unless you start
manipulating really massive strings). It can be a problem when writing
to the PDF file.

> It could be made even expandable in linear time
> with a large lookup table (256).

Right. I was thinking in terms of UTF-8 for some reason, and the
lookup table would be too big.

> And there is engine support (\pdfunescapehex).

Good.

>> A safe format where
>> more characters are as is seems possibly faster? Also, this doesn't
>> allow storage of Unicode data (unless we use an UTF, but the overhead
>> of decoding the UTF may be large). Do you think we could devise a more
>> efficient method?
>
> I think that depends on the Unicode support of the engine.

Right. I need to do give some serious thoughts to optimization in all
those encoding translations and string storages. Give me a few weeks
to have a good idea.

> Very short (e.g. label/anchor names, ...) up to very huge (e.g. object
> stream data of images, ...).

That's tough, then :).

Regards,
Bruno
```