LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Condense Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Sender:
Mailing list for the LaTeX3 project <[log in to unmask]>
Date:
Fri, 24 Feb 2006 11:21:03 +0100
Reply-To:
Mailing list for the LaTeX3 project <[log in to unmask]>
Subject:
MIME-Version:
1.0
Content-Transfer-Encoding:
7bit
In-Reply-To:
Content-Type:
text/plain; charset=unknown
From:
Frank Mittelbach <[log in to unmask]>
Parts/Attachments:
text/plain (126 lines)
I suggested to Philipp that we discuss this here as I have the feeling that
there are a number of problems associated with his suggested approach and I
hope to hear a few more opinions.

let's start with the original problem.
 > ---------- %< ----------
 > \documentclass{minimal}
 > \usepackage[utf8]{inputenc}
 > \begin{document}
 > 
 > ^^c3^^a4
 > 
 > \lowercase{^^c3^^a4} % fails
 > 
 > \MakeLowercase{^^c3^^a4}
 > 
 > \end{document}
 > ---------- %< ----------

using \lowercase or \uppercase in LaTeX is a general problem which is why
those two commands are explicitly not supported in general context but only in
very welldefined coding where the input to thoseprimitives is known.

LaTeX goes a long way to internally only use LICR sequences which then do not
have any such problem (and which is why \MakeLowercase first turns the input
to LICR before applying the TeX primitive).

so one question to ask is: are the scenarios mentioned in:

 > If a UTF-8 character is prefixed with \noexpand (or \string or 
 > \protect), however, the raw UTF-8 sequence still gets through to the 
 > primitive. 

represent valid LaTeX input/coding, or whether whatever is tried to achieved
has to be handled through interfaces designed to work correctly.

To answer this it would be good to explicitly show what kind of reasons there
would be to \string, \noexpand or \protect some UTF char that then result in
this behavior

however that is not to say, that LaTeX should not protect against errornous
input if that can be done in a safe way.

so lets have a look at the suggestions:

 > My suggestion was: why not set the uppercase and lowercase codes of 
 > all bytes used in UTF-8 to zero? The concept of uc/lccodes doesn't 
 > apply to UTF-8 anyway (at least not with an 8-bit engine...), why 
 > take the risk of having it backfire?

because ...

lc codes are unfortunately not only used for lowercasing text they are also
used for hyphenation. but they are used for hyphenation of the LICRs that
result from changing the UTF8 to the final glyph in the font encoding. Thus if
we would turn all lc codes for the upper half to zero, good by hyphenation of
most languages when typeset in T1 font encoding.

furthermore
 
 > There is one thing I didn't mention in the report. Since inputenc may 
 > switch the input encoding mid-stream, the codes would also need to be 
 > restored before a new encoding is initialized. So the issue at stake 
 > is really: should there by a central uc/lccode management in 
 > inputenc?

again the lc/uc is not really only a property of the inputenc it is formost a
property of the output encoding due to the unfortunate overloading with
hyphenation. And it gets one step further: the values for that are --- at least
with std TeX --- only looked at at the very end of the paragraph but inputenc
can bechanged in mid-paragraph.

inputenc currently solves this problem by  considering the inputencoding as
something that is removed as the very first step by turning chars into LICRs
and from then on all you deal with are a) 7bit which is transparent to writing
out and reading in and b) just with uc/lc on the LICR level which is then only
dependent on the output encoding.

 > This would also make fixes for 8-bit encodings possible which 
 > currently can't be handled by primitive case-changing operations. In 
 > 8-bit encodings such as latin1, latin9, winansi, etc., there are a 
 > few exeptions to the general rule that the encoding positions of 
 > uppercase and lowercase letters differ by 32. Primitive case-changing 
 > operations will produce surprising results in such cases.

they don't as the case changing is not primitive. they only produce surprising
results if the translation from input encoding to LICR is broken eg because
people used \uppercase rather than \MakeUppercase or in case they use a font
encoding which doesn't obey the LaTeX requirement of using the only allowed
ul/lc table (which is the one compatible to T1).

 > Here's an example (you may need to recode this for the characters in 
 > the first two columns to come out right):
 > 
 > ---------- %< ----------
 > \documentclass{article}
 > \usepackage[latin9]{inputenc}
 > \usepackage[T1]{fontenc}
 > \begin{document}
 > \centering\Large
 > 
 > Default settings:
 > 
 > \begin{tabular}{c@{$\neq$}c@{\hspace{2em}}c@{$\neq$}c}
 > ,b<(B & \uppercase{,b=(B} & ^^bc & \uppercase{^^bd}\\
 > ,b=(B & \lowercase{,b<(B} & ^^bd & \lowercase{^^bc}\\
 > ,b>(B & \uppercase
 > 

precisely: it uses unsupported lowercase and would not sow any defect if using
\MakeLowercase and \MakeUppercase.

so my feeling here is

 a) that's not the way to improve the situation
 b) the problem really only exists because of using those two primitives which
    are explicitly forbidden in LaTeX
 c) that the model used by inputenc to manage this is actually fine
 d) would could be improved is to set the chars involved in UTF8 to catcode 12
    while that encoding is active, however, whether that is really worth the
    effort  is doubtful as, so far I only see this guarding against incorrect
    coded  input or packages

comments welcome
frank

ATOM RSS1 RSS2