On Friday, 24 February 2006 11:21:03 +0100,
Frank Mittelbach <[log in to unmask]> writes:
 > I suggested to Philipp that we discuss this here as I have the feeling that
 > there are a number of problems associated with his suggested approach and I
 > hope to hear a few more opinions.


[...]
 > so lets have a look at the suggestions:
 > 
 >  > My suggestion was: why not set the uppercase and lowercase codes of 
 >  > all bytes used in UTF-8 to zero? The concept of uc/lccodes doesn't 
 >  > apply to UTF-8 anyway (at least not with an 8-bit engine...), why 
 >  > take the risk of having it backfire?
 > 
 > because ...
 > 
 > lc codes are unfortunately not only used for lowercasing text they are also
 > used for hyphenation. but they are used for hyphenation of the LICRs that
 > result from changing the UTF8 to the final glyph in the font encoding. Thus if
 > we would turn all lc codes for the upper half to zero, good by hyphenation of
 > most languages when typeset in T1 font encoding.

Some background: When TeX is breaking a paragraph into lines and a
word has to be hyphenated, TeX uses the current values in the lccode
table to "normalize" all glyph codes using the lowercase code.  If the
glyph has a zero lccode value, TeX will stop the word at this
character trying to hyphenate only the first part (cf. TeX.web,
section "@<Skip to node |hb|, putting letters into |hu| and |hc|@>",
where |hu| contains the original glyph code, |hc| the normalized
(=lowercased) code; the code line "if lc_code(c)=0 then goto done3;"
will stop collecting further glyph for hyphenation.).


 > furthermore
 >  
 >  > There is one thing I didn't mention in the report. Since inputenc may 
 >  > switch the input encoding mid-stream, the codes would also need to be 
 >  > restored before a new encoding is initialized. So the issue at stake 
 >  > is really: should there by a central uc/lccode management in 
 >  > inputenc?
 > 
 > again the lc/uc is not really only a property of the inputenc it is formost a
 > property of the output encoding due to the unfortunate overloading with
 > hyphenation. And it gets one step further: the values for that are --- at least
 > with std TeX --- only looked at at the very end of the paragraph but inputenc
 > can bechanged in mid-paragraph.
[...]

Using e-TeX (or pdf-(e)-TeX) instead of standard TeX, the new register
\savinghyphcodes can be set to a positive value when reading
hyphenation \patterns to create the format files.  This will save the
lccode values at the time the \patterns{...} are added for the current
language.  And if there are saved lccode values, they are used instead
of the \lccode values for hyphenation.

What is needed to make this work is add the assignment of this new
e-TeX register when loading hyphenation patterns.  In addition one has
to check the "correct" lccode settings for each language at that time.


But this will not solve the problem for standard TeX.  For TeX the
current LaTeX approach is fine.


-bernd