On Wed, Oct 12, 2011 at 01:19:24PM -0400, Bruno Le Floch wrote:

> > Most important for PDF strings:
> >
> > * PDFDocEncoding
> > * UTF-16
> >
> > (hyperref also uses "ascii-print" in case of XeTeX because of
> > encoding problems with \special.)
> Can you elaborate on the encoding problems that XeTeX has? (a link to
> an explanation would be great)

Afaik that is exactly the problem, that there is no explanation
(=documenation/specification) on the XeTeX side. In most cases
the contents of \special expects bytes, the conversion of bytes
with 8 bits to UTF-8 destroys the contents.

> >> I guess that most "iso-..." and "cp..." encodings are an overkill for
> >> a kernel.
> >
> > They should be loadable as files similar to LaTeX's .def files
> > for inputenc or fontenc. Then the kernel can provide a base set
> > and others can be provided by other projects. But I don't see
> > the disadvantage if such a base set is not minimal.
> Putting non-basic encodings in .def files is probably the best
> approach indeed. The time it takes for me to code all that is probably
> the only disadvantage, since that means postponing the floating point
> module.

I wouldn't do it manually. There are mappings files for Unicode:
In project I am using these mappings together with a perl script
to generate the .def files.

> > * String escaping, provided by \pdfescapestring.
> > * Name escaping, provided by \pdfescapename.
> > * Hex strings, provided by \pdfescapehex.
> I've coded all three already, based on one of your previous mails to
> LaTeX-L (when, two years ago, Joseph had mentionned strings here).


> > The latter is also useful for other contexts, e.g. for protecting
> > arbitrary string data in auxiliary files.
> It is definitely a safe way of storing data, but is it the most
> efficient?

Of course not, size is 2N.
* All safe characters could be used, then the size decreases
  (e.g. ASCII85, ...). But the problem is to find safe characters.
  In especially this set might change.
* Some kind of compression could be applied.

Using hex strings is just simple, fast and easy to implement.

> Decoding it requires setting the lccode of ^^@ to the
> number found, then \lowercase{\edef\result{\result^^@}} for every
> character, quadratic in the length of the string.

It could be made even expandable in linear time
with a large lookup table (256).

And there is engine support (\pdfunescapehex).

> A safe format where
> more characters are as is seems possibly faster? Also, this doesn't
> allow storage of Unicode data (unless we use an UTF, but the overhead
> of decoding the UTF may be large). Do you think we could devise a more
> efficient method?

I think that depends on the Unicode support of the engine.

> Are there cases where many ^^@ (byte 0) must be output in a row? If
> not, we can lowercase characters other than ^^@ to produce the
> relevant bytes, for instance
> \lccode0=... \lccode1=... [...] \lccode255=...
> \lowercase{\edef\result{\result^^@^^A...^^?}}
> The most efficient decoding method depends on how long the string is.
> What is the typical length of strings that we should optimize for?

Very short (e.g. label/anchor names, ...) up to very huge (e.g. object
stream data of images, ...).

Yours sincerely
  Heiko Oberdiek