LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: latex/3480: Support for UTF-8 missing in inputenc.sty
From:	Dominique Unruh <[log in to unmask]>
Reply To:	Mailing list for the LaTeX3 project <[log in to unmask]>
Date:	Thu, 5 Dec 2002 23:57:12 +0100
Content-Type:	text/plain
Parts/Attachments:	text/plain (78 lines)

Short info on what this discussion is about:

We were discussing the possibility of adding UTF-8 inputenc support to
LaTeX. The existing package ucs.sty is deemed to big/resource
consuming for inclusion into the kernel. This discussion is now moved
onto LATEX-L.

Frank wrote:
> it seems important to me to follow up the question Chris has posted
> about what are input and what are output (font) encodings.

Yes, I do understand this difference. But when adding UTF-8 support,
it is probably even unwise to load all supported UTF
sequences. Therefore I proposed to add to the fontenc an information,
which Unicode range is to be loaded for this fontencoding. To clarify
this, here an example:

if we have code like the following:

\usepackage[utf8]{inputenc}
\usepackage[T2A]{fontenc}

the file t2aenc.def could contain a line like:

\FontencUnicodeRange{"400-"4FF}

and \AtBeginDocument UTF-8 sequences would only be loaded for the
ranges given by the fontencodings, thus taking the need from the user
to decide by himself, which sequences to load. In case no UTF-8 is
needed, the \FontencUnicodeRange's are ignored.

Of course, the fontencoding->Unicode-Range mappings could also be in
some extra file, thus removing the need to change the existing
fontencodings.

> commands, eg instead of
> \DeclareInputText{164}{\textcurrency}
> we probably need something like
> [...]
> \DeclareUTFeightInputText{<whatever-number-or-identification>}{\textcurrency}

Code for this can be extracted from utf8.def as with
ucs.sty. Interested people could have a look at the following macros
in this file (unfortunately mostly undocumented (yet)):

\utf@viii@map{number} constructs the UTF-8 sequence formed \u8-n-BCD
where n is the first character of the sequence (as decimal number),
and BCD are the (one, two or three) further characters (as
characters). Here the macros content gets just number, but the macros
can easily be changes to define it to anything give
(e.g. \textcurrency).

\utf@viii@undef{number}{char}{char}{char} calculates the Unicode
number for some UTF-8 sequence (given again as number, char, char,
char, with \@nil instead of the chars for shorter sequences.)

A UTF-8 sequence starter would then have to be defined approximately
as (here the example for the sequence starter "E3 = 227)

\def\^^E3#1#2{\ifx\csname u8-227-#1#2\endcsname\relax
  \utf@viii@undef{227}#1#2\@nil\else
  \csname u8-227-#1#2\endcsname\fi}

\utf@viii@make does the job of defining such macros (containing some
additional code)

Chris wrote:
> I tried to understand Dominique's approach and to compare it with
> David's but both, as on CTAN, consist of undocumented code ... so
> I gave up.  Have you looked at David's code?

My code is documented (though only partly). The comments can be found
in utf8.dtx, or in the files in the CVS archive (see
http://www.unruh.de/DniQ/latex/unicode/). I don't know David's code,
could you give me a CTAN location?

DniQ.

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung