Print

Print


I've just looked again at the hyphenation patterns available for TeX and every
time I do that again I'm shocked what I find (in several respects).

one of the big problems that i see is that for most patterns out there it is
absolutely not clear for what kind of font encoding they have been produced.

With very few notable exceptions all of them are encoded using some sort of
hard code table, ie ^^ notation so that they are valid only for a single font
encoding.

This seems very unfortunate since if they would be stored in a different
format it would be possible to apply them to different font encodings.

Take, for example, T1 and LY1 both of which do contain all the glyphs needed
for a number of languages. Therefore a pattern set for French, or German, or
Polish, or ...  should be usable with any fonts in either encoding. But
unfortunately they are not because they refer to things like ^^b9 meaning \'z
(this is an example from the plhyph.tex file).

if we would replace such patterns by patterns looking like .\'c\'z8 we would
be able to reuse the patterns for several font encodings provided the internal
latex representations \'c etc are doing the right thing within the \patterns
command.

Analysing the behaviour of the fontencoding specific commands we can see the
following:

 \DeclareTextComposite \DeclareTextSymbol are fine as they expand in the mouth
 (using old TeX terminology) to a font position which is what we want

 But toplevel \DeclareTextCommand (such as \L in OT1) are likely a problem
 and so is most likely anything done via \DeclareTextCompositeCommand.

 Finally we do have \DeclareTextAccent which is also not suitable by default
 in a \patterns declaration since it results in a call to the \accent
 primitive.

So before discussion what could be done here let me first explain what is
currently being done with some of the hyphenation files.

A concept found in several files is to surround potentially problematic
patterns by a command \n which is, depending on the encoding used, either
defined to be \def\n#1{} (ie ignore) or \def\n#1{#1} use.

In other words you have a hyphenation file, say for German, which can be used
with T1 encoding but also with OT1 encoding by simply removing all patterns
which contain references to umlauts or sharp s. I'm not sure if the resulting
pattern file is the optimum possible for an encoding like OT1 (actually i
doubt that) but it is certainly sufficently accurate enough to be usable.
Perhaps Bernd Raichle can comment on this.

The problem in my eyes with this approach is that you have to know beforehand
which of the patterns are impossible to use in a certain encoding. Which is is
trivial if you design the file for a fixed (peferably small) number of
encodings it is in practice impossible if you do not know the encodings it
should be applied to.

so what would be the alternatives?

========================

here is my idea which most likely would need some further refinement.

suppose we have the pattern \patterns{.r\"u8stet} (which is taken from the
German hyphenation file).

suppose further that for each encoding we have defined a code point which is
not a letter, say, the position of `!' (the latter might be a bad choice i
don't know). For encodings which do encode not 256 characters we should be
able to chose a code point outside the encoding itself. Let's call this
character X

during pattern reading we then map the \add@accent command (which is what
finally is used in case of an internal representation for an accent which is
not also a composite) to

  \def\add@accent#1#2{X}

so what we get is the pattern \patterns{.rX8stet} which is an impossible
combination (especially if X lies outside the encoding range)

\DeclareTextCommand and \DeclareTextCompositeCommand would need to be handled
in a similar fashion (which would require some small changes to the internals
of nfss since at the moment \DeclareTextSymbol is internally calling
\DeclareTextCommand etc which would then not appropriate.

The downside of this approach is that for  encodings which would make a large
number of patterns invalid this way we unnecessarily store spurious patterns.

On the positive side is that i can go and say

\fontencoding{LY1}\selecfont
\input german-hyphenation-patterns

and automatically get the a set of patterns suitable for LY1

the reason i bring this up is that if one extends the notion of "current
encoding" to something like "current encodings suitable for a language" then
one needs to have hyphenation patterns for all language/font combintions (at
least that would be desirable)

the technical support for this approach doesn't seem to be very difficult to
provide, but i don't know if there would be enough people willing to actually
look at the hyphenation files out there bring them into a suitable source
form. The latter would be necessary in my opinion to some extend anyway, there
are a number of such files which would result in very strangly behaving LaTeX
format if one actually adds them

frank