As long-time subscribers of this list may perhaps recall, I have at times in the past written about the need for a proper representation of (input) character strings, because documenting code means one needs to treat identifiers and other character data as text even though they may contain all sorts of troublesome characters. Back in 2000 I released the xdoc2 package which has this as one of its functional areas, and it has served me well, but time has passed and this is one area that has been due for a modernisation. What I'm now making a (hopefully brief) "public beta" release of is a package called `harmless', which is a low-level package containing only functions for turning user input into "harmless character strings", and functions for making use of such harmless character strings. The package can (for the duration of the beta period, at least) be downloaded from http://www.mdh.se/polopoly_fs/1.57096!/Menu/general/column-content/attachment/harmless.zip and its typeset documentation is at http://www.mdh.se/polopoly_fs/1.57095!/Menu/general/column-content/attachment/harmless.pdf My reason for posting about it here on LATEX-L is twofold. For one, I think it would be a good idea for the expl3 documenting system to migrate to this more robust foundation, as I seem to recall there being identifiers here and there in expl3 that the present system utterly fails to handle correctly. The second, more immediate reason is however that I'd like a second opinion as to how well I've managed to follow the expl3 naming conventions; if there is something I misnamed, then it would be much better to fix it /before/ uploading a v1.0 to CTAN than after (even if that is mostly for the sake of principles; I don't expect a huge following anytime soon). And just to be clear: harmless is a LaTeX2e package rather than an expl3 package, and it does not require anything expl3, but it seeks to follow expl3 coding conventions in the two respects of source catcodes and control sequence naming. The first is a matter of forward compatibility, as there are plenty of places in the code where an extra space makes a lot of difference, and getting all of them right when converting to an ASCII-space-is-ignored \catcode setting (as would probably happen some day) would be difficult; better then to use that setting from day one. And if going that far with spaces and ~, one might as well do all of : and _ too, and use them rather than @ when naming one's own programming level commands. (I did not, however, user expl3-style names for TeX primitives and LaTeX2e commands, because doing a bulk search-and-replace of control sequence names will instead be pretty straightforward when that day comes.) Features of the harmless package include: * Not dependent on \catcode changes. * Supports both 8-bit and Unicode character models (and the latter is not restricted to the BMP). * Can convert an arbitrary sequence of tokens into a harmless character string. * Supports both 8-bit and UTF-8 as input encoding (see also below on the use of LICR commands as escapes for Unicode text) * Harmless character strings are robust: they can be written to a file and then \input again by TeX without distortion. * Harmless character strings can be typeset by way of LaTeX internal character representation. * Harmless character strings can be converted to a number of "data" formats, including: - PDFText - XML character data - raw character string - x-url-encoding - UTF-8 and UTF-16 - sanitized sequence of TeX character tokens (usable in a \csname) * Specific commands in text to be turned into harmless character strings may be used as escapes for hard-to-type things or meta items. Predefined sets of escapes (not active by default, but possible to activate through a single command) include: - backslash + one of space, #, $, %, &, backslash, ^, {, and } for that particular character - LICR commands for accents and non-A--Z letters (accents make Unicode combining characters) - \texorpdfstring - accent+base combinations for combining characters Users can define additional escapes and meta items, using convenient interface. * As an advanced feature, the body of an environment may be turned into a harmless character sequence. But that doesn't have a document level interface. That's not quite all the features, but it covers the bulk of it. So... comments, anyone? Lars Hellström PS: In case someone wonders "Why XML?", I might add that I sort-of promised that last year, in a proper research paper no less (http://ceur-ws.org/Vol-1010/paper-22.pdf). That application of the harmless package now has a working prototype (http://openmath.org/pipermail/om/2014-March/001835.html), even if it is far from feature-complete.