LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

On Tue, Oct 11, 2011 at 11:07:13PM -0400, Bruno Le Floch wrote:

> > hyperref already reencodes bookmark strings with setting
> > pdfencoding=auto. The bookmark string is construced in
> > Unicode encoding. Then the reencoding to PDFDocEncoding is tried.
> > If successful the result string is used, otherwise the Unicode string.
> > For the reencoding stuff package stringenc is used and don't need
> > to be expandable for hyperref.
> 
> Thank you Heiko. The stringenc package provides _many_ different
> encodings. Can you point me to which are useful for pdf purposes? 

Most important for PDF strings:

* PDFDocEncoding
* UTF-16

(hyperref also uses "ascii-print" in case of XeTeX because of
encoding problems with \special.)

> I guess that most "iso-..." and "cp..." encodings are an overkill for
> a kernel.

They should be loadable as files similar to LaTeX's .def files
for inputenc or fontenc. Then the kernel can provide a base set
and others can be provided by other projects. But I don't see
the disadvantage if such a base set is not minimal.

Then, when strings are written to PS/PDF, they need further
escaping:
* String escaping, provided by \pdfescapestring.
* Name escaping, provided by \pdfescapename.
* Hex strings, provided by \pdfescapehex.
The latter is also useful for other contexts, e.g. for protecting
arbitrary string data in auxiliary files. As hex string special
characters like '{', '}', '\', '#', ... do not harm.
  These pdfTeX features are provided for LuaTeX in package `pdftexcmds'
and package `pdfescape' provides the features for other engines.

> Also, when you say "Unicode encoding", I presume that this means
> native strings for XeTeX and LuaTeX, but what about pdfTeX? Do you use
> "UTF-16" (if so, LE or BE?), or some other UTF?

In the context of bookmarks and other PDF strings "Unicode"
means UTF-16 (hyperref uses BE, but there is a byte order mark).
And the strings are a sequence of bytes. The big chars of XeTeX or
LuaTeX don't help, because they get written as UTF-8.

Yours sincerely
  Heiko Oberdiek