## LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

#### View:

 Message: [ First | Previous | Next | Last ] By Topic: [ First | Previous | Next | Last ] By Author: [ First | Previous | Next | Last ] Font: Proportional Font

Subject:

hyphenation morass

From:

Date:

Fri, 9 Feb 2001 16:38:29 +0100

Content-Type:

text/plain

Parts/Attachments:

 text/plain (109 lines)
 I've just looked again at the hyphenation patterns available for TeX and every time I do that again I'm shocked what I find (in several respects). one of the big problems that i see is that for most patterns out there it is absolutely not clear for what kind of font encoding they have been produced. With very few notable exceptions all of them are encoded using some sort of hard code table, ie ^^ notation so that they are valid only for a single font encoding. This seems very unfortunate since if they would be stored in a different format it would be possible to apply them to different font encodings. Take, for example, T1 and LY1 both of which do contain all the glyphs needed for a number of languages. Therefore a pattern set for French, or German, or Polish, or ... should be usable with any fonts in either encoding. But unfortunately they are not because they refer to things like ^^b9 meaning \'z (this is an example from the plhyph.tex file). if we would replace such patterns by patterns looking like .\'c\'z8 we would be able to reuse the patterns for several font encodings provided the internal latex representations \'c etc are doing the right thing within the \patterns command. Analysing the behaviour of the fontencoding specific commands we can see the following:  \DeclareTextComposite \DeclareTextSymbol are fine as they expand in the mouth  (using old TeX terminology) to a font position which is what we want  But toplevel \DeclareTextCommand (such as \L in OT1) are likely a problem  and so is most likely anything done via \DeclareTextCompositeCommand.  Finally we do have \DeclareTextAccent which is also not suitable by default  in a \patterns declaration since it results in a call to the \accent  primitive. So before discussion what could be done here let me first explain what is currently being done with some of the hyphenation files. A concept found in several files is to surround potentially problematic patterns by a command \n which is, depending on the encoding used, either defined to be \def\n#1{} (ie ignore) or \def\n#1{#1} use. In other words you have a hyphenation file, say for German, which can be used with T1 encoding but also with OT1 encoding by simply removing all patterns which contain references to umlauts or sharp s. I'm not sure if the resulting pattern file is the optimum possible for an encoding like OT1 (actually i doubt that) but it is certainly sufficently accurate enough to be usable. Perhaps Bernd Raichle can comment on this. The problem in my eyes with this approach is that you have to know beforehand which of the patterns are impossible to use in a certain encoding. Which is is trivial if you design the file for a fixed (peferably small) number of encodings it is in practice impossible if you do not know the encodings it should be applied to. so what would be the alternatives? ======================== here is my idea which most likely would need some further refinement. suppose we have the pattern \patterns{.r\"u8stet} (which is taken from the German hyphenation file). suppose further that for each encoding we have defined a code point which is not a letter, say, the position of !' (the latter might be a bad choice i don't know). For encodings which do encode not 256 characters we should be able to chose a code point outside the encoding itself. Let's call this character X during pattern reading we then map the \add@accent command (which is what finally is used in case of an internal representation for an accent which is not also a composite) to   \def\add@accent#1#2{X} so what we get is the pattern \patterns{.rX8stet} which is an impossible combination (especially if X lies outside the encoding range) \DeclareTextCommand and \DeclareTextCompositeCommand would need to be handled in a similar fashion (which would require some small changes to the internals of nfss since at the moment \DeclareTextSymbol is internally calling \DeclareTextCommand etc which would then not appropriate. The downside of this approach is that for encodings which would make a large number of patterns invalid this way we unnecessarily store spurious patterns. On the positive side is that i can go and say \fontencoding{LY1}\selecfont \input german-hyphenation-patterns and automatically get the a set of patterns suitable for LY1 the reason i bring this up is that if one extends the notion of "current encoding" to something like "current encodings suitable for a language" then one needs to have hyphenation patterns for all language/font combintions (at least that would be desirable) the technical support for this approach doesn't seem to be very difficult to provide, but i don't know if there would be enough people willing to actually look at the hyphenation files out there bring them into a suitable source form. The latter would be necessary in my opinion to some extend anyway, there are a number of such files which would result in very strangly behaving LaTeX format if one actually adds them frank`