LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Classic View Use Monospaced Font Show Text Part by Default Condense Mail Headers
Topic:	[<< First] [< Prev] [Next >] [Last >>]

Mime-Version: 1.0 (Apple Message framework v553)

Content-Type: text/plain; charset=ISO-8859-1; format=flowed

Date: Sat, 4 Mar 2006 22:14:16 +0100

Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]>

Subject: Re: LICR objects

From: Lars Hellström <[log in to unmask]>

In-Reply-To: <[log in to unmask]>

Content-Transfer-Encoding: 8bit

Sender: Mailing list for the LaTeX3 project <[log in to unmask]>

Parts/Attachments: text/plain (123 lines)

Lördagen den 4 mars 2006 kl 17.15 skrev Heiko Oberdiek:
> Hello,
>
> I am interested in a mapping Unicode to LICR, therefore I should
> understand what a LICR really is.
>
> Literature:
> [TLC2] Frank Mittelbach et.al., The LaTeX Companion, 2nd edition.
>
> LICR is an abbreviation for "LaTeX internal character representation"
> (TLC2, 7.11.1)
>
> LaTeX is based on TeX, thus is the following assumption correct?
>
> (1) LICR consists of a sequence of one or more TeX tokens.
>
> Conclusion:
> (2) LICR cannot be empty.
> That would mean ignoring characters cannot not be handled
> by an empty LICR.

Ignoring a character can't be done by mapping it to the empty token 
sequence, you mean? This would seem to imply that it is important to 
record the fact that there was a character there. Why would one need 
this?

> The variety of TeX tokens is large.
> Are there restrictions?

Plenty. I'd expect everything not explicitly allowed (perhaps by 
belonging to a family of commands, of course) is forbidden/unsupported.

> Starting at the basics:
>
> TLC2, table 7.31 "LICR objects represented with single characters"
> I am sure about:
>
> (3) LICR-letter := A_11, ..., Z_11, a_11, ..., z_11
> This means uppercase and lowercase ASCII letters with catcode 11.
>
> (4) LICR-other := 0_12, ..., 9_12,
>                   ._12, ,_12, ;_12, :_12, ?_12, !_12, '_12, `_12,
>                   *_12, +_12, -_12, =_12,
>                   (_12, )_12, [_12, ]_12, /_12, @_12
>
> Regarding catcodes: TeX does not differentiate between
> A_11 or A_12, if the letter A is typeset. Thus is
> A_12 also a LICR and does "A" has more than one LICR?

Hmm... it is probably safe to use them interchangably (as I recall it, 
there is in ltoutenc.dtx a command for defining text commands that 
would typeset them via tokens whose catcode are the same for letters 
and symbols, so there is probably no difference in the boxes that are 
generated), but they're not exactly the same. E.g. \ifx would 
distinguish A_11 and A_12.

> I interpret the characters below "Not Represented as Characters"
> are not LICRs for all catcodes:
> (5) LICR <> $, ^, _, {, }, #, &, %, \, ~,
>             <, >, |, "
> There exist other representations of these characters, e.g.
>   $: \textdollar, \$
>   {: \textbraceleft, \{
>   |: \textbar, \mid, \vert
> Are both LICRs or which is? I suspect the \text... vesions.

\mid and \vert are math commands, hence not LICRs. \{ branches 
depending on whether you're in math mode or not, so it is a higher 
level command than the LICR ones. \$ I don't know. I wouldn't want to 
have it as LICR, but I'm not sure what Frank thinks.

> My guesses in case that a character is mapped to one LICR only:
>   $: \textdollar (U+0024 DOLLAR SIGN)
>   ^: \textasciicircum (U+005E CIRCUMFLEX ACCENT)
>   _: \textunderscore (U+005F LOW LINE)
>   {: \textbraceleft (U+007B LEFT CURLY BRACKET)
>   }: \textbraceright (U+007D RIGHT CURLY BRACKET)
>   #: \# (U+0023 NUMBER SIGN)
>   &: \& (U+0026 AMPERSAND)
>   %: \% (U+0025 PERCENT SIGN)
>   \: \textbackslash (U+005C REVERSE SOLIDUS)
>   ~: \textasciitilde (U+007E TILDE)
>   <: \textless (U+003C LESS-THAN SIGN)
>> : \textgreater (U+003E GREATER-THAN SIGN)
>   |: \textbar (U+007C VERTICAL LINE)
>
> Thus the entry for U+02C6 in utf8enc.dfu is not really correct:
>   \DeclareUnicodeCharacter{02C6}{\textasciicircum}
>   U+02C6 MODIFIER LETTER CIRCUMFLEX ACCENT
> "\^" would be more correct, except that grabbing the
> argument isn't too trivial in case of utf-8 characters
> consisting of several bytes.

Aren't you thinking of the COMBINING circumflex accent here? MODIFIER 
characters are more phonetic alphabet thingies.

> Next issue: ligatures, e.g.
>   U+2013 EN DASH
>   utf8enc.dfu: \DeclareUnicodeCharacter{2013}{\textendash}
> What about "--" for the en dash?

It has the malfeature that it always kerns as hyphen on the left. 
\textendash is the only way to get the true thing.

> Does the en dash has two LICRs, "\textendash" and "--"?
>
> What is the LICR of "fi"?
>   U+FB01 LATIN SMALL LIGATURE FI
> The ligature mechanism depends on the used fonts, "fi" is not
> always available. What is better?
>   \DeclareUnicodeCharacter{FB01}{\textfi}
>   \ProvideTextCommandDefault{\textfi}{fi}
> vs.
>   \DeclareUnicodeCharacter{FB01}{fi}

Definitely the latter. As I understand it, these ligatures are in 
unicode mostly for compatibility with legacy encodings (and perhaps for 
font designers who need to assign something to these glyphs). At least 
as far as TeX is concerned, "fi" doesn't carry any semantic information 
different from "f" "i".

Lars Hellström

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung