LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Classic View Use Monospaced Font Show Text Part by Default Condense Mail Headers
Topic:	[<< First] [< Prev] [Next >] [Last >>]

Sender: Mailing list for the LaTeX3 project <[log in to unmask]>

Subject: Re: LaTeX's internal char prepresentation (UTF8 or Unicode?)

From: Roozbeh Pournader <[log in to unmask]>

Date: Sun, 11 Feb 2001 19:47:44 +0330

In-Reply-To: <[log in to unmask]>

Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]>

Parts/Attachments: text/plain (62 lines)

On Sun, 11 Feb 2001, Frank Mittelbach wrote:

> The LaTeX internal character representation is a 7bit representation not an
> 8bit one as UTF8. As such it is far less likely to be mangled by incorrect
> conversion if files are exchanged between different platforms. I have yet to
> see that UTF8 text (without taking precaution and externally announcing that a
> file is in UTF8) is really properly handled by any OS platform. Is it?

Windows 2000 autodetects them. I can't define the proper handling in Linux
well; you mean in a text editor?

>  wouldn't it be better if the internal LaTeX representation would be Unicode
>  in one or the other flavor?

What about symbol fonts like TC? What about math characters that are
unified in Unicode (\rightarrow and \longrightarrow)? What about the
things that are not yet in Unicode?

>  - however, not clear is that the resulting names are easier to read, eg
>    \unicode{00e4} viz \"a.

They are worse than you may think. They are always hard to read. My real
work is related to Unicode Arabic script, and after two years of full
dedication, I can't recall more than a few codes. I always need a table at
hand. I have much less experience with Knuthian names of math symbols, but
I'm sure I can recall the names of more than 95% of them without any
problem.

>  - with intermediate forms like data written to files this could be a pain and
>    people in Russia, for example, already have this problem when they see
>    something like \cyr\CYRA\cyrn\cyrn\cyro\cyrt\cyra\cyrc\cyri\cyrya.  In case
>    of unicode as the internal representation this would be true for all
>    languages (except English) while currently the Latin based ones are still
>    basically okay.

This is a place where UTF8 helps a lot. People can use Unicode text
editors to see the files, or use the widely available convertors like
iconv to convert to theoretically every charset.

>  - the current latex internal representation is richer than unicode for good
>    or worse, eg \" is defined individually as representation for accenting the
>    next char, which means that anything \"<base-char-in-the-internal-reps> is
>    automatically also a member of it, eg \"g.
>
>  - the latter point could be considered bad since it allows to produce
>    characters not in unicode but independently of what you feel about that the
>    fact itself as consequences when defining mappings for font
>    encodings. right now, one specifies the accents, ie \DeclareTextAccent\"
>    and for those glyphs that exists as composites one also specifies the
>    composite, eg \DeclareTextComposite{\"}{T1}{a}{...}
>    With the unicode approach as the internal representation there would be
>    an atomic form ie  \unicode{00e4} describing umlaut-a so if that has no
>    representation in a font, eg in OT1, then one would need to define for each
>    combination the result.

Unicode also has the equivalent of \", it only appears after the letter.
So the problem of a accented letter not in Unicode is not a real problem,
these letters can also be made in Unicode. But I don't know what are you
going to do with the combining accent appearing after the letter.

--roozbeh

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung