As long-time subscribers of this list may perhaps recall, I have at times in
the past written about the need for a proper representation of (input)
character strings, because documenting code means one needs to treat
identifiers and other character data as text even though they may contain
all sorts of troublesome characters. Back in 2000 I released the xdoc2
package which has this as one of its functional areas, and it has served me
well, but time has passed and this is one area that has been due for a
modernisation.
What I'm now making a (hopefully brief) "public beta" release of is a
package called `harmless', which is a low-level package containing only
functions for turning user input into "harmless character strings", and
functions for making use of such harmless character strings. The package can
(for the duration of the beta period, at least) be downloaded from
http://www.mdh.se/polopoly_fs/1.57096!/Menu/general/column-content/attachment/harmless.zip
and its typeset documentation is at
http://www.mdh.se/polopoly_fs/1.57095!/Menu/general/column-content/attachment/harmless.pdf
My reason for posting about it here on LATEX-L is twofold. For one, I think
it would be a good idea for the expl3 documenting system to migrate to this
more robust foundation, as I seem to recall there being identifiers here and
there in expl3 that the present system utterly fails to handle correctly.
The second, more immediate reason is however that I'd like a second opinion
as to how well I've managed to follow the expl3 naming conventions; if there
is something I misnamed, then it would be much better to fix it /before/
uploading a v1.0 to CTAN than after (even if that is mostly for the sake of
principles; I don't expect a huge following anytime soon).
And just to be clear: harmless is a LaTeX2e package rather than an expl3
package, and it does not require anything expl3, but it seeks to follow
expl3 coding conventions in the two respects of source catcodes and control
sequence naming. The first is a matter of forward compatibility, as there
are plenty of places in the code where an extra space makes a lot of
difference, and getting all of them right when converting to an
ASCII-space-is-ignored \catcode setting (as would probably happen some day)
would be difficult; better then to use that setting from day one. And if
going that far with spaces and ~, one might as well do all of : and _ too,
and use them rather than @ when naming one's own programming level commands.
(I did not, however, user expl3-style names for TeX primitives and LaTeX2e
commands, because doing a bulk search-and-replace of control sequence names
will instead be pretty straightforward when that day comes.)
Features of the harmless package include:
* Not dependent on \catcode changes.
* Supports both 8-bit and Unicode character models (and the latter is not
restricted to the BMP).
* Can convert an arbitrary sequence of tokens into a harmless character
string.
* Supports both 8-bit and UTF-8 as input encoding (see also below on the
use of LICR commands as escapes for Unicode text)
* Harmless character strings are robust: they can be written to a file and
then \input again by TeX without distortion.
* Harmless character strings can be typeset by way of LaTeX internal
character representation.
* Harmless character strings can be converted to a number of "data"
formats, including:
- PDFText
- XML character data
- raw character string
- x-url-encoding
- UTF-8 and UTF-16
- sanitized sequence of TeX character tokens (usable in a \csname)
* Specific commands in text to be turned into harmless character strings
may be used as escapes for hard-to-type things or meta items. Predefined
sets of escapes (not active by default, but possible to activate through a
single command) include:
- backslash + one of space, #, $, %, &, backslash, ^, {, and } for that
particular character
- LICR commands for accents and non-A--Z letters (accents make Unicode
combining characters)
- \texorpdfstring
- accent+base combinations for combining characters
Users can define additional escapes and meta items, using convenient interface.
* As an advanced feature, the body of an environment may be turned into a
harmless character sequence. But that doesn't have a document level interface.
That's not quite all the features, but it covers the bulk of it. So...
comments, anyone?
Lars Hellström
PS: In case someone wonders "Why XML?", I might add that I sort-of promised
that last year, in a proper research paper no less
(http://ceur-ws.org/Vol-1010/paper-22.pdf). That application of the harmless
package now has a working prototype
(http://openmath.org/pipermail/om/2014-March/001835.html), even if it is far
from feature-complete.
|