On Sat, Mar 04, 2006 at 10:14:16PM +0100, Lars Hellström wrote:
> Lördagen den 4 mars 2006 kl 17.15 skrev Heiko Oberdiek:
> >Hello,
> >
> >I am interested in a mapping Unicode to LICR, therefore I should
> >understand what a LICR really is.
> >
> >Literature:
> >[TLC2] Frank Mittelbach et.al., The LaTeX Companion, 2nd edition.
> >
> >LICR is an abbreviation for "LaTeX internal character representation"
> >(TLC2, 7.11.1)
> >
> >LaTeX is based on TeX, thus is the following assumption correct?
> >
> >(1) LICR consists of a sequence of one or more TeX tokens.
> >
> >Conclusion:
> >(2) LICR cannot be empty.
> >That would mean ignoring characters cannot not be handled
> >by an empty LICR.
>
> Ignoring a character can't be done by mapping it to the empty token
> sequence, you mean? This would seem to imply that it is important to
> record the fact that there was a character there. Why would one need
> this?
I don't, but this is used in next.def, where 0xFE and 0xFF isn't
part of the NextStep encoding:
\DeclareInputText{254}{}
\DeclareInputText{255}{}
Thus actually an empty "LICR" is used here.
> >Starting at the basics:
> >
> >TLC2, table 7.31 "LICR objects represented with single characters"
> >I am sure about:
> >
> >(3) LICR-letter := A_11, ..., Z_11, a_11, ..., z_11
> >This means uppercase and lowercase ASCII letters with catcode 11.
> >
> >(4) LICR-other := 0_12, ..., 9_12,
> > ._12, ,_12, ;_12, :_12, ?_12, !_12, '_12, `_12,
> > *_12, +_12, -_12, =_12,
> > (_12, )_12, [_12, ]_12, /_12, @_12
> >
> >Regarding catcodes: TeX does not differentiate between
> >A_11 or A_12, if the letter A is typeset. Thus is
> >A_12 also a LICR and does "A" has more than one LICR?
>
> Hmm... it is probably safe to use them interchangably (as I recall it,
> there is in ltoutenc.dtx a command for defining text commands that
> would typeset them via tokens whose catcode are the same for letters
> and symbols, so there is probably no difference in the boxes that are
> generated), but they're not exactly the same. E.g. \ifx would
> distinguish A_11 and A_12.
Yes, for typesetting I don't remember a difference between
catcodes 11 and 12. But the token representations of the LICRs
are different.
> \mid and \vert are math commands, hence not LICRs. \{ branches
> depending on whether you're in math mode or not, so it is a higher
> level command than the LICR ones.
That means, the command tokens in LICR are limited to
commands defined by the nfss2 \Declare... commands?
> \$ I don't know. I wouldn't want to
> have it as LICR, but I'm not sure what Frank thinks.
\$ is also higher level and not defined by \Declare...
and therefore I would assume no LICR.
> >Thus the entry for U+02C6 in utf8enc.dfu is not really correct:
> > \DeclareUnicodeCharacter{02C6}{\textasciicircum}
> > U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT
> >"\^" would be more correct, except that grabbing the
> >argument isn't too trivial in case of utf-8 characters
> >consisting of several bytes.
>
> Aren't you thinking of the COMBINING circumflex accent here?
Yes.
> MODIFIER characters are more phonetic alphabet thingies.
Thanks.
> >Does the en dash has two LICRs, "\textendash" and "--"?
> >
> >What is the LICR of "fi"?
> > U+FB01 LATIN SMALL LIGATURE FI
> >The ligature mechanism depends on the used fonts, "fi" is not
> >always available. What is better?
> > \DeclareUnicodeCharacter{FB01}{\textfi}
> > \ProvideTextCommandDefault{\textfi}{fi}
> >vs.
> > \DeclareUnicodeCharacter{FB01}{fi}
>
> Definitely the latter. As I understand it, these ligatures are in
> unicode mostly for compatibility with legacy encodings (and perhaps for
> font designers who need to assign something to these glyphs). At least
> as far as TeX is concerned, "fi" doesn't carry any semantic information
> different from "f" "i".
Example: Assuming there is a word "deaffish" and the
author does not want a ligature ffi spanning both word parts.
Therefore, having a good editor, he uses the Unicode sequence
U+0066 U+FB01 to specify the correct and desired ligature.
Using the later case of \DeclareUnicodeCharacter{FB01}
TeX would get "ffi" and then form the wrong ligature.
Yours sincerely
Heiko <[log in to unmask]>
|