Heiko Oberdiek writes:
> I am interested in a mapping Unicode to LICR, therefore I should
> understand what a LICR really is.
wouldn't we all? :-)
> Literature:
> [TLC2] Frank Mittelbach et.al., The LaTeX Companion, 2nd edition.
>
> LICR is an abbreviation for "LaTeX internal character representation"
> (TLC2, 7.11.1)
up front I guess I should say that in writing TLC2 I wasn't attempting to
write a computer science paper and so some of the explanations in there
oversimplify for the sake of presentability. And of course they don't put down
all of our/my thinking about these things.
"character" in LICR is meant in a similar way as characters are used in
unicode ie the abstract concept not the visual "glyph" representation (not
that unicode would be consistent here but that's a different story)
> LaTeX is based on TeX, thus is the following assumption correct?
>
> (1) LICR consists of a sequence of one or more TeX tokens.
that's true but a fairly trivial conclusion from the fact that TeX is used,
but not a definition of what an LICR is. (eg not every dog is a pudel)
- the main characteristic of an LICR is that it is transparent to visual
representation attributes, ie if X is the/an LICR for the text character foo
then this is true in all circumstances, eg regardless of font changes,
writing to files, reading back in ...
- there is in fact potentially more than one LICR for the same abstract
character
- the model deals with characters for text, it doesn't cover math in
TeX as that has completely different underlying models.
a conclusion from the above and the fact that a lot of TeXs typesetting engine
is happening without much intervention (eg chars -> glyph is usually handled
by taking a char number as a slot in the current font table) is that if X is
an LICR then it needs to automagically adjusts its "meaning" whenever outer
attributes change.
For the chars A-Z and so on this is trivially true by the LATeX doctrine that
text fonts have to have those characters in their slot position (which makes
them them simple LICR objects in table 7.31). For everything else it really
means that LICRs have to be font encoding specific commands only.
> Conclusion:
> (2) LICR cannot be empty.
> That would mean ignoring characters cannot not be handled
> by an empty LICR.
I have no idea why that would be a conclusion from the first point.
Anyway, assuming there is an abstract character called "nothing here" then yes
it could be represented as an empty LICR as that is transparent as required
(it always does nothing). However, I wouldn't have thought of it as an LICR
so far andthe fact that the next encoding has two such slots with {} on the
right side ... that doesn't prove anything other than nobody is perfect and
this encoding was never being used or looked at much.
> The variety of TeX tokens is large.
> Are there restrictions?
yes see above
> Starting at the basics:
>
> TLC2, table 7.31 "LICR objects represented with single characters"
> I am sure about:
>
> (3) LICR-letter := A_11, ..., Z_11, a_11, ..., z_11
> This means uppercase and lowercase ASCII letters with catcode 11.
>
> (4) LICR-other := 0_12, ..., 9_12,
> ._12, ,_12, ;_12, :_12, ?_12, !_12, '_12, `_12,
> *_12, +_12, -_12, =_12,
> (_12, )_12, [_12, ]_12, /_12, @_12
>
> Regarding catcodes: TeX does not differentiate between
> A_11 or A_12, if the letter A is typeset. Thus is
> A_12 also a LICR and does "A" has more than one LICR?
interesting question of whether it means A_11 or just A, in other words the
fact that on lowlevel TeX can assign different catcodes does that have to be
carried through to a higher level LaTeX model?
is A_12 something that is meaningful in LaTeX, can a LaTeX user generate it to
a level that it is important? I don't really think so.
so take your pick: either
A_11 is the LICR and A_12 something like an LICR alias that largely behaves
like A_11 but in certain circumstances will be transformed into A_11 (eg
when writing out and reading back in)
or they are both LICRs rpresenting the character A
or A_12 is not a LaTeX concept and programmers making use of that fact are
on their own and have to know how to get from A_12 to A_11
> I interpret the characters below "Not Represented as Characters"
> are not LICRs for all catcodes:
> (5) LICR <> $, ^, _, {, }, #, &, %, \, ~,
> <, >, |, "
the wording "for all catcodes" takes you out of LaTeX again (from a model
perspective). LaTeX doesn't have catcodes or allows users of LaTeX to access
or change them. underlying technology may use these things but nothing on the
model level
otherwise yes: none of these are LICRs the first line because they are
standard LaTeX syntax tokens and not able to represent characters at all in
the official setup and the second line becase they do not have the property of
representing the same character in all situations (eg if you change from one
text font to another)
> There exist other representations of these characters, e.g.
> $: \textdollar, \$
> {: \textbraceleft, \{
> |: \textbar, \mid, \vert
> Are both LICRs or which is? I suspect the \text... vesions.
\text... as Lars already remarked \mid and \vert are not even close to LICRs
as they are math commands completely unrelated to text commands and not at all
having the property of representing a char in all text contexts
\$ \{ are a bit more tricky. they are established input methods and eventually
resolve into an LICR when used in a text context.
>
> My guesses in case that a character is mapped to one LICR only:
> $: \textdollar (U+0024 DOLLAR SIGN)
> ^: \textasciicircum (U+005E CIRCUMFLEX ACCENT)
> _: \textunderscore (U+005F LOW LINE)
> {: \textbraceleft (U+007B LEFT CURLY BRACKET)
> }: \textbraceright (U+007D RIGHT CURLY BRACKET)
> #: \# (U+0023 NUMBER SIGN)
> &: \& (U+0026 AMPERSAND)
> %: \% (U+0025 PERCENT SIGN)
> \: \textbackslash (U+005C REVERSE SOLIDUS)
> ~: \textasciitilde (U+007E TILDE)
> <: \textless (U+003C LESS-THAN SIGN)
> >: \textgreater (U+003E GREATER-THAN SIGN)
> |: \textbar (U+007C VERTICAL LINE)
yes and no. ideally yes, but the way the model works it is possible to define
more than one LICR representing the same character. we have tried to avoid
this and so far I think in the official set of supported encodings this is not
the case.
> Thus the entry for U+02C6 in utf8enc.dfu is not really correct:
> \DeclareUnicodeCharacter{02C6}{\textasciicircum}
> U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT
> "\^" would be more correct, except that grabbing the
> argument isn't too trivial in case of utf-8 characters
> consisting of several bytes.
modifier letters in unicode are a bunch of very special things and aren't
really representable at all in the TeX/LaTeX world ... but that is a different
story.
that particular entry could be indeed wrong
> Next issue: ligatures, e.g.
> U+2013 EN DASH
> utf8enc.dfu: \DeclareUnicodeCharacter{2013}{\textendash}
> What about "--" for the en dash?
> Does the en dash has two LICRs, "\textendash" and "--"?
TLC2 is deliberately a bit vague on this. but in some sense they are two LICRs
representing the same character. the only way to avoid this would be to
disallow -- on input which seems counterproductive. but it is a boundary case
eg !` is explicitly mentioned as not being considered a LICR as it isn't
universally supported.
> What is the LICR of "fi"?
> U+FB01 LATIN SMALL LIGATURE FI
> The ligature mechanism depends on the used fonts, "fi" is not
> always available. What is better?
> \DeclareUnicodeCharacter{FB01}{\textfi}
> \ProvideTextCommandDefault{\textfi}{fi}
> vs.
> \DeclareUnicodeCharacter{FB01}{fi}
difficult to say. in my opinion UC is at fault making those ligatures
characters and million others not so i would probably go the second
alternative
> At last the remaining tokens are:
> (a) commands, short form (\^, \ , \., ...)
> (b) commands with letter names (\c, \textperiodcentered)
> (c) balanced curly braces with standard catcodes 1 and 2 for
> arguments.
> Is the list complete?
not sure what this list should signify. what can appear in the second arg of
\DeclareUnicodeCharacter?
> Question for (b). All names I found in utf8enc.dfu or the other
> input encoding files usually use A-Za-z only. The exception
> is \@tabacckludge with "@" in the name. Is this correct for all LICRs?
it is correct for all LICR that they are 7-bit in a way that reading and
writing under different input encoding names will not make them break or
change. As LaTeX internally defines @ to be catcode 11 while reading its own
files @ is allowed as a possibility. However, LICRs are meant to work as save
input methods for users which is why all the publically declared LICRs do not
have @ in their name.
And fore anybody asks \@tabacckludge\' isn't anything else than a hack as \'
isn't technically a proper LICR because of the famous overloading of \' inside
tabbing. A more correct way would have been to define a \textacc... LICR and
make \' and the others point to it outside tabbing. However, it happened
differently and though there was some discussion on normalizing it never came
to that as the current solution worked well enough
frank
|