LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	[latex/3844] uc/lccode controls in inputenc?
From:	Philipp Lehman <[log in to unmask]>
Reply To:	Mailing list for the LaTeX3 project <[log in to unmask]>
Date:	Thu, 23 Feb 2006 19:27:45 +0100
Content-Type:	text/plain
Parts/Attachments:	text/plain (73 lines)

I've recently filed the following feature suggestion in the Latex bugs 

database:



	latex/3844 (2006-02-22): UTF-8 sanitation in inputenc



I'm opening this thread in order to stirr up some discussion and add 

some additional information. The original report was prompted by the 

following problem:



---------- %< ----------

\documentclass{minimal}

\usepackage[utf8]{inputenc}

\begin{document}



^^c3^^a4



\lowercase{^^c3^^a4} % fails



\MakeLowercase{^^c3^^a4}



\end{document}

---------- %< ----------



The letter ä is not a special case, it's a problem across the entire 

0xC3 subrange and one could think of similar problems with other 

bytes, too.



As mentioned in the report, \MakeUppercase and \MakeLowercase are not 

affected by this because of the preprocessing with \protected@edef 

and \@uclclist. This will implicitly `decode' the UTF-8 sequence so 

that all the primitives ever get to see are LICRs.



If a UTF-8 character is prefixed with \noexpand (or \string or 

\protect), however, the raw UTF-8 sequence still gets through to the 

primitive. This case might trigger an additional problem: since the 

catcodes of consecutive bytes are not sanitized, the second byte 

might be an active character which fires within the \[log in to unmask]



My suggestion was: why not set the uppercase and lowercase codes of 

all bytes used in UTF-8 to zero? The concept of uc/lccodes doesn't 

apply to UTF-8 anyway (at least not with an 8-bit engine...), why 

take the risk of having it backfire?



There is one thing I didn't mention in the report. Since inputenc may 

switch the input encoding mid-stream, the codes would also need to be 

restored before a new encoding is initialized. So the issue at stake 

is really: should there by a central uc/lccode management in 

inputenc?



This would also make fixes for 8-bit encodings possible which 

currently can't be handled by primitive case-changing operations. In 

8-bit encodings such as latin1, latin9, winansi, etc., there are a 

few exeptions to the general rule that the encoding positions of 

uppercase and lowercase letters differ by 32. Primitive case-changing 

operations will produce surprising results in such cases.



Here's an example (you may need to recode this for the characters in 

the first two columns to come out right):



---------- %< ----------

\documentclass{article}

\usepackage[latin9]{inputenc}

\usepackage[T1]{fontenc}

\begin{document}

\centering\Large



Default settings:



\begin{tabular}{c@{$\neq$}c@{\hspace{2em}}c@{$\neq$}c}

Œ & \uppercase{œ} & ^^bc & \uppercase{^^bd}\\

œ & \lowercase{Œ} & ^^bd & \lowercase{^^bc}\\

Ÿ & \uppercase

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung