Print

Print


Short info on what this discussion is about:

We were discussing the possibility of adding UTF-8 inputenc support to
LaTeX. The existing package ucs.sty is deemed to big/resource
consuming for inclusion into the kernel. This discussion is now moved
onto LATEX-L.

Frank wrote:
> it seems important to me to follow up the question Chris has posted
> about what are input and what are output (font) encodings.

Yes, I do understand this difference. But when adding UTF-8 support,
it is probably even unwise to load all supported UTF
sequences. Therefore I proposed to add to the fontenc an information,
which Unicode range is to be loaded for this fontencoding. To clarify
this, here an example:

if we have code like the following:

\usepackage[utf8]{inputenc}
\usepackage[T2A]{fontenc}

the file t2aenc.def could contain a line like:

\FontencUnicodeRange{"400-"4FF}

and \AtBeginDocument UTF-8 sequences would only be loaded for the
ranges given by the fontencodings, thus taking the need from the user
to decide by himself, which sequences to load. In case no UTF-8 is
needed, the \FontencUnicodeRange's are ignored.

Of course, the fontencoding->Unicode-Range mappings could also be in
some extra file, thus removing the need to change the existing
fontencodings.

> commands, eg instead of
> \DeclareInputText{164}{\textcurrency}
> we probably need something like
> [...]
> \DeclareUTFeightInputText{<whatever-number-or-identification>}{\textcurrency}

Code for this can be extracted from utf8.def as with
ucs.sty. Interested people could have a look at the following macros
in this file (unfortunately mostly undocumented (yet)):

\utf@viii@map{number} constructs the UTF-8 sequence formed \u8-n-BCD
where n is the first character of the sequence (as decimal number),
and BCD are the (one, two or three) further characters (as
characters). Here the macros content gets just number, but the macros
can easily be changes to define it to anything give
(e.g. \textcurrency).

\utf@viii@undef{number}{char}{char}{char} calculates the Unicode
number for some UTF-8 sequence (given again as number, char, char,
char, with \@nil instead of the chars for shorter sequences.)

A UTF-8 sequence starter would then have to be defined approximately
as (here the example for the sequence starter "E3 = 227)

\def\^^E3#1#2{\ifx\csname u8-227-#1#2\endcsname\relax
  \utf@viii@undef{227}#1#2\@nil\else
  \csname u8-227-#1#2\endcsname\fi}

\utf@viii@make does the job of defining such macros (containing some
additional code)

Chris wrote:
> I tried to understand Dominique's approach and to compare it with
> David's but both, as on CTAN, consist of undocumented code ... so
> I gave up.  Have you looked at David's code?

My code is documented (though only partly). The comments can be found
in utf8.dtx, or in the files in the CVS archive (see
http://www.unruh.de/DniQ/latex/unicode/). I don't know David's code,
could you give me a CTAN location?

DniQ.