LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

As long-time subscribers of this list may perhaps recall, I have at times in 
the past written about the need for a proper representation of (input) 
character strings, because documenting code means one needs to treat 
identifiers and other character data as text even though they may contain 
all sorts of troublesome characters. Back in 2000 I released the xdoc2 
package which has this as one of its functional areas, and it has served me 
well, but time has passed and this is one area that has been due for a 
modernisation.

What I'm now making a (hopefully brief) "public beta" release of is a 
package called `harmless', which is a low-level package containing only 
functions for turning user input into "harmless character strings", and 
functions for making use of such harmless character strings. The package can 
(for the duration of the beta period, at least) be downloaded from

http://www.mdh.se/polopoly_fs/1.57096!/Menu/general/column-content/attachment/harmless.zip

and its typeset documentation is at

http://www.mdh.se/polopoly_fs/1.57095!/Menu/general/column-content/attachment/harmless.pdf

My reason for posting about it here on LATEX-L is twofold. For one, I think 
it would be a good idea for the expl3 documenting system to migrate to this 
more robust foundation, as I seem to recall there being identifiers here and 
there in expl3 that the present system utterly fails to handle correctly. 
The second, more immediate reason is however that I'd like a second opinion 
as to how well I've managed to follow the expl3 naming conventions; if there 
is something I misnamed, then it would be much better to fix it /before/ 
uploading a v1.0 to CTAN than after (even if that is mostly for the sake of 
principles; I don't expect a huge following anytime soon).

And just to be clear: harmless is a LaTeX2e package rather than an expl3 
package, and it does not require anything expl3, but it seeks to follow 
expl3 coding conventions in the two respects of source catcodes and control 
sequence naming. The first is a matter of forward compatibility, as there 
are plenty of places in the code where an extra space makes a lot of 
difference, and getting all of them right when converting to an 
ASCII-space-is-ignored \catcode setting (as would probably happen some day) 
would be difficult; better then to use that setting from day one. And if 
going that far with spaces and ~, one might as well do all of : and _ too, 
and use them rather than @ when naming one's own programming level commands. 
(I did not, however, user expl3-style names for TeX primitives and LaTeX2e 
commands, because doing a bulk search-and-replace of control sequence names 
will instead be pretty straightforward when that day comes.)

Features of the harmless package include:

  * Not dependent on \catcode changes.
  * Supports both 8-bit and Unicode character models (and the latter is not 
restricted to the BMP).
  * Can convert an arbitrary sequence of tokens into a harmless character 
string.
  * Supports both 8-bit and UTF-8 as input encoding (see also below on the 
use of LICR commands as escapes for Unicode text)
  * Harmless character strings are robust: they can be written to a file and 
then \input again by TeX without distortion.
  * Harmless character strings can be typeset by way of LaTeX internal 
character representation.
  * Harmless character strings can be converted to a number of "data" 
formats, including:
   - PDFText
   - XML character data
   - raw character string
   - x-url-encoding
   - UTF-8 and UTF-16
   - sanitized sequence of TeX character tokens (usable in a \csname)
  * Specific commands in text to be turned into harmless character strings 
may be used as escapes for hard-to-type things or meta items. Predefined 
sets of escapes (not active by default, but possible to activate through a 
single command) include:
   - backslash + one of space, #, $, %, &, backslash, ^, {, and } for that 
particular character
   - LICR commands for accents and non-A--Z letters (accents make Unicode 
combining characters)
   - \texorpdfstring
   - accent+base combinations for combining characters
Users can define additional escapes and meta items, using convenient interface.
  * As an advanced feature, the body of an environment may be turned into a 
harmless character sequence. But that doesn't have a document level interface.

That's not quite all the features, but it covers the bulk of it. So... 
comments, anyone?

Lars Hellström


PS: In case someone wonders "Why XML?", I might add that I sort-of promised 
that last year, in a proper research paper no less 
(http://ceur-ws.org/Vol-1010/paper-22.pdf). That application of the harmless 
package now has a working prototype 
(http://openmath.org/pipermail/om/2014-March/001835.html), even if it is far 
from feature-complete.