Subject: | |
From: | |
Reply To: | |
Date: | Thu, 23 Feb 2006 19:27:45 +0100 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
I've recently filed the following feature suggestion in the Latex bugs
database:
latex/3844 (2006-02-22): UTF-8 sanitation in inputenc
I'm opening this thread in order to stirr up some discussion and add
some additional information. The original report was prompted by the
following problem:
---------- %< ----------
\documentclass{minimal}
\usepackage[utf8]{inputenc}
\begin{document}
^^c3^^a4
\lowercase{^^c3^^a4} % fails
\MakeLowercase{^^c3^^a4}
\end{document}
---------- %< ----------
The letter ä is not a special case, it's a problem across the entire
0xC3 subrange and one could think of similar problems with other
bytes, too.
As mentioned in the report, \MakeUppercase and \MakeLowercase are not
affected by this because of the preprocessing with \protected@edef
and \@uclclist. This will implicitly `decode' the UTF-8 sequence so
that all the primitives ever get to see are LICRs.
If a UTF-8 character is prefixed with \noexpand (or \string or
\protect), however, the raw UTF-8 sequence still gets through to the
primitive. This case might trigger an additional problem: since the
catcodes of consecutive bytes are not sanitized, the second byte
might be an active character which fires within the \[log in to unmask]
My suggestion was: why not set the uppercase and lowercase codes of
all bytes used in UTF-8 to zero? The concept of uc/lccodes doesn't
apply to UTF-8 anyway (at least not with an 8-bit engine...), why
take the risk of having it backfire?
There is one thing I didn't mention in the report. Since inputenc may
switch the input encoding mid-stream, the codes would also need to be
restored before a new encoding is initialized. So the issue at stake
is really: should there by a central uc/lccode management in
inputenc?
This would also make fixes for 8-bit encodings possible which
currently can't be handled by primitive case-changing operations. In
8-bit encodings such as latin1, latin9, winansi, etc., there are a
few exeptions to the general rule that the encoding positions of
uppercase and lowercase letters differ by 32. Primitive case-changing
operations will produce surprising results in such cases.
Here's an example (you may need to recode this for the characters in
the first two columns to come out right):
---------- %< ----------
\documentclass{article}
\usepackage[latin9]{inputenc}
\usepackage[T1]{fontenc}
\begin{document}
\centering\Large
Default settings:
\begin{tabular}{c@{$\neq$}c@{\hspace{2em}}c@{$\neq$}c}
¼ & \uppercase{½} & ^^bc & \uppercase{^^bd}\\
½ & \lowercase{¼} & ^^bd & \lowercase{^^bc}\\
¾ & \uppercase
|
|
|