## LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

 Options: Use Classic View Use Monospaced Font Show Text Part by Default Condense Mail Headers Topic: [<< First] [< Prev] [Next >] [Last >>]

 Mime-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Date: Sat, 4 Mar 2006 23:26:28 +0100 Content-Disposition: inline Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]> Subject: Re: LICR objects From: Heiko Oberdiek <[log in to unmask]> In-Reply-To: <[log in to unmask]> Content-Transfer-Encoding: 8bit Sender: Mailing list for the LaTeX3 project <[log in to unmask]> Parts/Attachments: text/plain (117 lines) ```On Sat, Mar 04, 2006 at 10:14:16PM +0100, Lars Hellström wrote: > Lördagen den 4 mars 2006 kl 17.15 skrev Heiko Oberdiek: > >Hello, > > > >I am interested in a mapping Unicode to LICR, therefore I should > >understand what a LICR really is. > > > >Literature: > >[TLC2] Frank Mittelbach et.al., The LaTeX Companion, 2nd edition. > > > >LICR is an abbreviation for "LaTeX internal character representation" > >(TLC2, 7.11.1) > > > >LaTeX is based on TeX, thus is the following assumption correct? > > > >(1) LICR consists of a sequence of one or more TeX tokens. > > > >Conclusion: > >(2) LICR cannot be empty. > >That would mean ignoring characters cannot not be handled > >by an empty LICR. > > Ignoring a character can't be done by mapping it to the empty token > sequence, you mean? This would seem to imply that it is important to > record the fact that there was a character there. Why would one need > this? I don't, but this is used in next.def, where 0xFE and 0xFF isn't part of the NextStep encoding:   \DeclareInputText{254}{}   \DeclareInputText{255}{} Thus actually an empty "LICR" is used here. > >Starting at the basics: > > > >TLC2, table 7.31 "LICR objects represented with single characters" > >I am sure about: > > > >(3) LICR-letter := A_11, ..., Z_11, a_11, ..., z_11 > >This means uppercase and lowercase ASCII letters with catcode 11. > > > >(4) LICR-other := 0_12, ..., 9_12, > > ._12, ,_12, ;_12, :_12, ?_12, !_12, '_12, `_12, > > *_12, +_12, -_12, =_12, > > (_12, )_12, [_12, ]_12, /_12, @_12 > > > >Regarding catcodes: TeX does not differentiate between > >A_11 or A_12, if the letter A is typeset. Thus is > >A_12 also a LICR and does "A" has more than one LICR? > > Hmm... it is probably safe to use them interchangably (as I recall it, > there is in ltoutenc.dtx a command for defining text commands that > would typeset them via tokens whose catcode are the same for letters > and symbols, so there is probably no difference in the boxes that are > generated), but they're not exactly the same. E.g. \ifx would > distinguish A_11 and A_12. Yes, for typesetting I don't remember a difference between catcodes 11 and 12. But the token representations of the LICRs are different. > \mid and \vert are math commands, hence not LICRs. \{ branches > depending on whether you're in math mode or not, so it is a higher > level command than the LICR ones. That means, the command tokens in LICR are limited to commands defined by the nfss2 \Declare... commands? > \\$ I don't know. I wouldn't want to > have it as LICR, but I'm not sure what Frank thinks. \\$ is also higher level and not defined by \Declare... and therefore I would assume no LICR. > >Thus the entry for U+02C6 in utf8enc.dfu is not really correct: > > \DeclareUnicodeCharacter{02C6}{\textasciicircum} > > U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT > >"\^" would be more correct, except that grabbing the > >argument isn't too trivial in case of utf-8 characters > >consisting of several bytes. > > Aren't you thinking of the COMBINING circumflex accent here? Yes. > MODIFIER characters are more phonetic alphabet thingies. Thanks. > >Does the en dash has two LICRs, "\textendash" and "--"? > > > >What is the LICR of "fi"? > > U+FB01 LATIN SMALL LIGATURE FI > >The ligature mechanism depends on the used fonts, "fi" is not > >always available. What is better? > > \DeclareUnicodeCharacter{FB01}{\textfi} > > \ProvideTextCommandDefault{\textfi}{fi} > >vs. > > \DeclareUnicodeCharacter{FB01}{fi} > > Definitely the latter. As I understand it, these ligatures are in > unicode mostly for compatibility with legacy encodings (and perhaps for > font designers who need to assign something to these glyphs). At least > as far as TeX is concerned, "fi" doesn't carry any semantic information > different from "f" "i". Example: Assuming there is a word "deaffish" and the author does not want a ligature ffi spanning both word parts. Therefore, having a good editor, he uses the Unicode sequence U+0066 U+FB01 to specify the correct and desired ligature. Using the later case of \DeclareUnicodeCharacter{FB01} TeX would get "ffi" and then form the wrong ligature. Yours sincerely   Heiko <[log in to unmask]> ```