## LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

 Options: Use Classic View Use Monospaced Font Show Text Part by Default Condense Mail Headers Topic: [<< First] [< Prev] [Next >] [Last >>]

 Content-Type: text/plain; charset="iso-8859-15" Date: Thu, 23 Feb 2006 19:27:45 +0100 Content-Disposition: inline Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]> MIME-Version: 1.0 Content-Transfer-Encoding: 8bit Sender: Mailing list for the LaTeX3 project <[log in to unmask]> From: Philipp Lehman <[log in to unmask]> Parts/Attachments: text/plain (73 lines) I've recently filed the following feature suggestion in the Latex bugs database: latex/3844 (2006-02-22): UTF-8 sanitation in inputenc I'm opening this thread in order to stirr up some discussion and add some additional information. The original report was prompted by the following problem: ---------- %< ---------- \documentclass{minimal} \usepackage[utf8]{inputenc} \begin{document} ^^c3^^a4 \lowercase{^^c3^^a4} % fails \MakeLowercase{^^c3^^a4} \end{document} ---------- %< ---------- The letter ä is not a special case, it's a problem across the entire 0xC3 subrange and one could think of similar problems with other bytes, too. As mentioned in the report, \MakeUppercase and \MakeLowercase are not affected by this because of the preprocessing with \protected@edef and \@uclclist. This will implicitly decode' the UTF-8 sequence so that all the primitives ever get to see are LICRs. If a UTF-8 character is prefixed with \noexpand (or \string or \protect), however, the raw UTF-8 sequence still gets through to the primitive. This case might trigger an additional problem: since the catcodes of consecutive bytes are not sanitized, the second byte might be an active character which fires within the \[log in to unmask] My suggestion was: why not set the uppercase and lowercase codes of all bytes used in UTF-8 to zero? The concept of uc/lccodes doesn't apply to UTF-8 anyway (at least not with an 8-bit engine...), why take the risk of having it backfire? There is one thing I didn't mention in the report. Since inputenc may switch the input encoding mid-stream, the codes would also need to be restored before a new encoding is initialized. So the issue at stake is really: should there by a central uc/lccode management in inputenc? This would also make fixes for 8-bit encodings possible which currently can't be handled by primitive case-changing operations. In 8-bit encodings such as latin1, latin9, winansi, etc., there are a few exeptions to the general rule that the encoding positions of uppercase and lowercase letters differ by 32. Primitive case-changing operations will produce surprising results in such cases. Here's an example (you may need to recode this for the characters in the first two columns to come out right): ---------- %< ---------- \documentclass{article} \usepackage[latin9]{inputenc} \usepackage[T1]{fontenc} \begin{document} \centering\Large Default settings: \begin{tabular}{c@{$\neq$}c@{\hspace{2em}}c@{$\neq$}c} ¼ & \uppercase{½} & ^^bc & \uppercase{^^bd}\\ ½ & \lowercase{¼} & ^^bd & \lowercase{^^bc}\\ ¾ & \uppercase `