LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Philipp Lehman <[log in to unmask]>
Reply To:
Mailing list for the LaTeX3 project <[log in to unmask]>
Date:
Thu, 23 Feb 2006 19:27:45 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (73 lines)
I've recently filed the following feature suggestion in the Latex bugs 
database:

	latex/3844 (2006-02-22): UTF-8 sanitation in inputenc

I'm opening this thread in order to stirr up some discussion and add 
some additional information. The original report was prompted by the 
following problem:

---------- %< ----------
\documentclass{minimal}
\usepackage[utf8]{inputenc}
\begin{document}

^^c3^^a4

\lowercase{^^c3^^a4} % fails

\MakeLowercase{^^c3^^a4}

\end{document}
---------- %< ----------

The letter  is not a special case, it's a problem across the entire 
0xC3 subrange and one could think of similar problems with other 
bytes, too.

As mentioned in the report, \MakeUppercase and \MakeLowercase are not 
affected by this because of the preprocessing with \protected@edef 
and \@uclclist. This will implicitly `decode' the UTF-8 sequence so 
that all the primitives ever get to see are LICRs.

If a UTF-8 character is prefixed with \noexpand (or \string or 
\protect), however, the raw UTF-8 sequence still gets through to the 
primitive. This case might trigger an additional problem: since the 
catcodes of consecutive bytes are not sanitized, the second byte 
might be an active character which fires within the \[log in to unmask]

My suggestion was: why not set the uppercase and lowercase codes of 
all bytes used in UTF-8 to zero? The concept of uc/lccodes doesn't 
apply to UTF-8 anyway (at least not with an 8-bit engine...), why 
take the risk of having it backfire?

There is one thing I didn't mention in the report. Since inputenc may 
switch the input encoding mid-stream, the codes would also need to be 
restored before a new encoding is initialized. So the issue at stake 
is really: should there by a central uc/lccode management in 
inputenc?

This would also make fixes for 8-bit encodings possible which 
currently can't be handled by primitive case-changing operations. In 
8-bit encodings such as latin1, latin9, winansi, etc., there are a 
few exeptions to the general rule that the encoding positions of 
uppercase and lowercase letters differ by 32. Primitive case-changing 
operations will produce surprising results in such cases.

Here's an example (you may need to recode this for the characters in 
the first two columns to come out right):

---------- %< ----------
\documentclass{article}
\usepackage[latin9]{inputenc}
\usepackage[T1]{fontenc}
\begin{document}
\centering\Large

Default settings:

\begin{tabular}{c@{$\neq$}c@{\hspace{2em}}c@{$\neq$}c}
 & \uppercase{} & ^^bc & \uppercase{^^bd}\\
 & \lowercase{} & ^^bd & \lowercase{^^bc}\\
 & \uppercase

ATOM RSS1 RSS2