LATEX-L Archives

Mailing list for the LaTeX3 project


Options: Use Classic View

Use Monospaced Font
Show Text Part by Default
Condense Mail Headers

Topic: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Mime-Version: 1.0
Sender: Mailing list for the LaTeX3 project <[log in to unmask]>
From: Heiko Oberdiek <[log in to unmask]>
Date: Sat, 4 Mar 2006 17:15:41 +0100
Content-Type: text/plain; charset=us-ascii
Content-Disposition: inline
Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]>
Parts/Attachments: text/plain (103 lines)

I am interested in a mapping Unicode to LICR, therefore I should
understand what a LICR really is.

[TLC2] Frank Mittelbach, The LaTeX Companion, 2nd edition.

LICR is an abbreviation for "LaTeX internal character representation"
(TLC2, 7.11.1)

LaTeX is based on TeX, thus is the following assumption correct?

(1) LICR consists of a sequence of one or more TeX tokens.

(2) LICR cannot be empty.
That would mean ignoring characters cannot not be handled
by an empty LICR.

The variety of TeX tokens is large.
Are there restrictions?

Starting at the basics:

TLC2, table 7.31 "LICR objects represented with single characters"
I am sure about:

(3) LICR-letter := A_11, ..., Z_11, a_11, ..., z_11
This means uppercase and lowercase ASCII letters with catcode 11.

(4) LICR-other := 0_12, ..., 9_12,
                  ._12, ,_12, ;_12, :_12, ?_12, !_12, '_12, `_12,
                  *_12, +_12, -_12, =_12,
                  (_12, )_12, [_12, ]_12, /_12, @_12

Regarding catcodes: TeX does not differentiate between
A_11 or A_12, if the letter A is typeset. Thus is
A_12 also a LICR and does "A" has more than one LICR?

I interpret the characters below "Not Represented as Characters"
are not LICRs for all catcodes:
(5) LICR <> $, ^, _, {, }, #, &, %, \, ~,
            <, >, |, "
There exist other representations of these characters, e.g.
  $: \textdollar, \$
  {: \textbraceleft, \{
  |: \textbar, \mid, \vert
Are both LICRs or which is? I suspect the \text... vesions.

My guesses in case that a character is mapped to one LICR only:
  $: \textdollar (U+0024 DOLLAR SIGN)
  ^: \textasciicircum (U+005E CIRCUMFLEX ACCENT)
  _: \textunderscore (U+005F LOW LINE)
  {: \textbraceleft (U+007B LEFT CURLY BRACKET)
  }: \textbraceright (U+007D RIGHT CURLY BRACKET)
  #: \# (U+0023 NUMBER SIGN)
  &: \& (U+0026 AMPERSAND)
  %: \% (U+0025 PERCENT SIGN)
  \: \textbackslash (U+005C REVERSE SOLIDUS)
  ~: \textasciitilde (U+007E TILDE)
  <: \textless (U+003C LESS-THAN SIGN)
  >: \textgreater (U+003E GREATER-THAN SIGN)
  |: \textbar (U+007C VERTICAL LINE)

Thus the entry for U+02C6 in utf8enc.dfu is not really correct:
"\^" would be more correct, except that grabbing the
argument isn't too trivial in case of utf-8 characters
consisting of several bytes.

Next issue: ligatures, e.g.
  U+2013 EN DASH
  utf8enc.dfu: \DeclareUnicodeCharacter{2013}{\textendash}
What about "--" for the en dash?
Does the en dash has two LICRs, "\textendash" and "--"?

What is the LICR of "fi"?
The ligature mechanism depends on the used fonts, "fi" is not
always available. What is better?

At last the remaining tokens are:
(a) commands, short form (\^, \ , \., ...)
(b) commands with letter names (\c, \textperiodcentered)
(c) balanced curly braces with standard catcodes 1 and 2 for
Is the list complete?

Question for (b). All names I found in utf8enc.dfu or the other
input encoding files usually use A-Za-z only. The exception
is \@tabacckludge with "@" in the name. Is this correct for all LICRs?

Yours sincerely
  Heiko <[log in to unmask]>