On Wed, Oct 12, 2011 at 01:19:24PM -0400, Bruno Le Floch wrote: > > Most important for PDF strings: > > > > * PDFDocEncoding > > * UTF-16 > > > > (hyperref also uses "ascii-print" in case of XeTeX because of > > encoding problems with \special.) > > Can you elaborate on the encoding problems that XeTeX has? (a link to > an explanation would be great) Afaik that is exactly the problem, that there is no explanation (=documenation/specification) on the XeTeX side. In most cases the contents of \special expects bytes, the conversion of bytes with 8 bits to UTF-8 destroys the contents. > >> I guess that most "iso-..." and "cp..." encodings are an overkill for > >> a kernel. > > > > They should be loadable as files similar to LaTeX's .def files > > for inputenc or fontenc. Then the kernel can provide a base set > > and others can be provided by other projects. But I don't see > > the disadvantage if such a base set is not minimal. > > Putting non-basic encodings in .def files is probably the best > approach indeed. The time it takes for me to code all that is probably > the only disadvantage, since that means postponing the floating point > module. I wouldn't do it manually. There are mappings files for Unicode: http://unicode.org/Public/MAPPINGS/ In project I am using these mappings together with a perl script to generate the .def files. > > * String escaping, provided by \pdfescapestring. > > * Name escaping, provided by \pdfescapename. > > * Hex strings, provided by \pdfescapehex. > > I've coded all three already, based on one of your previous mails to > LaTeX-L (when, two years ago, Joseph had mentionned strings here). Good. > > The latter is also useful for other contexts, e.g. for protecting > > arbitrary string data in auxiliary files. > > It is definitely a safe way of storing data, but is it the most > efficient? Of course not, size is 2N. * All safe characters could be used, then the size decreases (e.g. ASCII85, ...). But the problem is to find safe characters. In especially this set might change. * Some kind of compression could be applied. Using hex strings is just simple, fast and easy to implement. > Decoding it requires setting the lccode of ^^@ to the > number found, then \lowercase{\edef\result{\result^^@}} for every > character, quadratic in the length of the string. It could be made even expandable in linear time with a large lookup table (256). And there is engine support (\pdfunescapehex). > A safe format where > more characters are as is seems possibly faster? Also, this doesn't > allow storage of Unicode data (unless we use an UTF, but the overhead > of decoding the UTF may be large). Do you think we could devise a more > efficient method? I think that depends on the Unicode support of the engine. > Are there cases where many ^^@ (byte 0) must be output in a row? If > not, we can lowercase characters other than ^^@ to produce the > relevant bytes, for instance > > \lccode0=... \lccode1=... [...] \lccode255=... > \lowercase{\edef\result{\result^^@^^A...^^?}} > > The most efficient decoding method depends on how long the string is. > What is the typical length of strings that we should optimize for? Very short (e.g. label/anchor names, ...) up to very huge (e.g. object stream data of images, ...). Yours sincerely Heiko Oberdiek