Hello, I am interested in a mapping Unicode to LICR, therefore I should understand what a LICR really is. Literature: [TLC2] Frank Mittelbach et.al., The LaTeX Companion, 2nd edition. LICR is an abbreviation for "LaTeX internal character representation" (TLC2, 7.11.1) LaTeX is based on TeX, thus is the following assumption correct? (1) LICR consists of a sequence of one or more TeX tokens. Conclusion: (2) LICR cannot be empty. That would mean ignoring characters cannot not be handled by an empty LICR. The variety of TeX tokens is large. Are there restrictions? Starting at the basics: TLC2, table 7.31 "LICR objects represented with single characters" I am sure about: (3) LICR-letter := A_11, ..., Z_11, a_11, ..., z_11 This means uppercase and lowercase ASCII letters with catcode 11. (4) LICR-other := 0_12, ..., 9_12, ._12, ,_12, ;_12, :_12, ?_12, !_12, '_12, `_12, *_12, +_12, -_12, =_12, (_12, )_12, [_12, ]_12, /_12, @_12 Regarding catcodes: TeX does not differentiate between A_11 or A_12, if the letter A is typeset. Thus is A_12 also a LICR and does "A" has more than one LICR? I interpret the characters below "Not Represented as Characters" are not LICRs for all catcodes: (5) LICR <> $, ^, _, {, }, #, &, %, \, ~, <, >, |, " There exist other representations of these characters, e.g. $: \textdollar, \$ {: \textbraceleft, \{ |: \textbar, \mid, \vert Are both LICRs or which is? I suspect the \text... vesions. My guesses in case that a character is mapped to one LICR only: $: \textdollar (U+0024 DOLLAR SIGN) ^: \textasciicircum (U+005E CIRCUMFLEX ACCENT) _: \textunderscore (U+005F LOW LINE) {: \textbraceleft (U+007B LEFT CURLY BRACKET) }: \textbraceright (U+007D RIGHT CURLY BRACKET) #: \# (U+0023 NUMBER SIGN) &: \& (U+0026 AMPERSAND) %: \% (U+0025 PERCENT SIGN) \: \textbackslash (U+005C REVERSE SOLIDUS) ~: \textasciitilde (U+007E TILDE) <: \textless (U+003C LESS-THAN SIGN) >: \textgreater (U+003E GREATER-THAN SIGN) |: \textbar (U+007C VERTICAL LINE) Thus the entry for U+02C6 in utf8enc.dfu is not really correct: \DeclareUnicodeCharacter{02C6}{\textasciicircum} U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT "\^" would be more correct, except that grabbing the argument isn't too trivial in case of utf-8 characters consisting of several bytes. Next issue: ligatures, e.g. U+2013 EN DASH utf8enc.dfu: \DeclareUnicodeCharacter{2013}{\textendash} What about "--" for the en dash? Does the en dash has two LICRs, "\textendash" and "--"? What is the LICR of "fi"? U+FB01 LATIN SMALL LIGATURE FI The ligature mechanism depends on the used fonts, "fi" is not always available. What is better? \DeclareUnicodeCharacter{FB01}{\textfi} \ProvideTextCommandDefault{\textfi}{fi} vs. \DeclareUnicodeCharacter{FB01}{fi} At last the remaining tokens are: (a) commands, short form (\^, \ , \., ...) (b) commands with letter names (\c, \textperiodcentered) (c) balanced curly braces with standard catcodes 1 and 2 for arguments. Is the list complete? Question for (b). All names I found in utf8enc.dfu or the other input encoding files usually use A-Za-z only. The exception is \@tabacckludge with "@" in the name. Is this correct for all LICRs? Yours sincerely Heiko <[log in to unmask]>