LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Forum View Use Monospaced Font Show Text Part by Default Condense Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Sender:	Mailing list for the LaTeX3 project <[log in to unmask]>
Subject:	Re: LaTeX's internal char representation (UTF8 or Unicode?)
From:	Frank Mittelbach <[log in to unmask]>
Date:	Sun, 11 Feb 2001 19:54:17 +0100
In-Reply-To:	<[log in to unmask]>
Reply-To:	Mailing list for the LaTeX3 project <[log in to unmask]>
Parts/Attachments:	text/plain (158 lines)

Marcel,

 >  > So... time for another attempt?
 >
 > Yes, yes, yes!

make that maybe, maybe, maybe, and i'm with you. right now i don't see the
arguments put forward convincingly pointing in one or the other direction (and
i hope it is clear that i'm not arguing for staying with the status quo i'm
just trying to get argument pro and con onto the table)


 > I have yet to see that UTF8 text
 >  > (without taking precaution and externally announcing that a file is
 >  > in UTF8) is really properly handled by any OS platform. Is it?
 >
 > Not at the moment.  But there is a strong movement pushing for UTF8 as
 > _the_ encoding standard.  Support in bleeding edge versions of a lot
 > of software is actually quite good.  As far as I can see, UTF8 is the
 > only standard that has a reasonable chance of becoming the one that
 > "works without taking precaution".

i can see that. but it doesn't really mean that you are better off making at
this stage the LICR unicode based or rather UTF8 based since the current LICR
is in disguise unicode based if you like.

 >  > TeX is 7bit with a parser that accepts 8bit but doesn't by default
 >  > gives it any meaning. On the other hand Omega is 16bit (or more
 >  > these days?) and could be viewed as internally using something like
 >  > Unicode for representation.
 >
 > This is good because UTF8 is a proper superset of what TeX is
 > currently taking as input.  Does anybody know about the state of
 > Omega?  16-bit Unicode is not the whole game, and also not particularly
 > attractive as an input encoding.

i'm really not that much concerned about the input encoding, i'm concerned
about the internal handling and the mapping to glyphs in the end.


okay, so now for your additional arguments:

 >  >  wouldn't it be better if the internal LaTeX representation would
 >  >  be Unicode in one or the other flavor?
 >
 > Yes, because:
 >
 > - A LaTeX specific naming scheme will be essentially unmaintainable.

a point not too far away from my statement that the naming scheme would get
more uniform.

 > - LaTeX could eventually be made to output diagnostics and log files
 >   in UTF8.  For example, UTF8-enabled Xterms exist now, and will
 >   likely come as default on Linux distributions long before LaTeX3.
 >   Same for text editors.  So the infrastructure is getting to a state
 >   where it's possible to pull this off.

that's indeed a good one (as soon as that would work)

problem is that the larger part of the game is from messages coming from TeX
iteself (like overfull boxes) and those would be 8bit without being UTF8 and
as a result would most likely kill UTF8 interpretation of the whole file or
the terminal messages.

so without the use of a different program beneath this might just be a dream

 >  >  - however, not clear is that the resulting names are easier to
 >  >    read, eg \unicode{00e4} viz \"a.
 >
 > See remark about Xterms.

again that depends on the above, as long as most TeX installations write out
8bit chars as ^^xy (four 7bit chars) you are not even close. by which i mean
you can't even reliably write UTF8 in TeX (i can in mine but in fact only
because it uses a hardwired table which isn't transparent so is even wrong)
and it depends on the defaults used during installation so right now no code
can depend on it being possible.

 >  >  - the current latex internal representation is richer than unicode
 >  >    for good or worse, eg \" is defined individually as
 >  >    representation for accenting the next char, which means that
 >  >    anything \"<base-char-in-the-internal-reps> is automatically
 >  >    also a member of it, eg \"g.
 >
 > This is not necessarily a problem (cf. Roozbeh's remark about math
 > symbols which are not properly defined in unicode).

i was playing with small little twines most of the afternoon so wasn't able to
remark on that yet.

 >  As long as there
 > are no hyphenation and nongeneric kerning issues involved (and those
 > seem only an issue for natural language scripts), one could still have
 > named symbols as they exists now, whether they part of some special
 > font, a combination of glyphs, the drawing of some box, etc.

true

 >  >  - the latter point could be considered bad since it allows to
 >  >    produce characters not in unicode but independently of what you
 >  >    feel about that the fact itself as consequences when defining
 >  >    mappings for font encodings. right now, one specifies the
 >  >    accents, ie \DeclareTextAccent\" and for those glyphs that
 >  >    exists as composites one also specifies the composite, eg
 >  >    \DeclareTextComposite{\"}{T1}{a}{...}  With the unicode approach
 >  >    as the internal representation there would be an atomic form ie
 >  >    \unicode{00e4} describing umlaut-a so if that has no
 >  >    representation in a font, eg in OT1, then one would need to
 >  >    define for each combination the result.
 >
 > This seems to be the logical thing to do?!

but it would automatically restrict the combinitions possible to the
"allowable" set. ie there wouldn't be really any general accent command any
more or only with a huge additional coding effort as it wouldn't naturally
fit.


where was the remark concerning unicode a` handling of such things? thought it
was in this mail but ...

anyway, this is really the catch, you can't provide this a sequence like the
above as no meaning in TeX and can't be made having one easily since you can't
apply something like \accent to a char already passed by. so to implement that
(in TeX!) you would need to actually parse manually from each char forward
(yes there is Active TeX and we also have build parser like that in the past,
but ...)

to me (unless proven wrong by a production usable implementation:-) indicates
that LICR can't have a" as a member (ie small-latin-a followed by
combining-diaeresis) but should only have the equvialent of
small-latin-a-with-diaeresis, ie in its current implementation \"a.

in other words, in my opinion, the restriction on TeXs processing model
suggest that that something like

  small-latin-g combining-diaeresis

in the input source would need to be turned (via input parsing (or say input
encoding)) into

  small-latin-g-with-diaeresis

which is not a unicode character (i hope at least, for the sake of the example
:-). So the LICR needs to be more than unicode  stream anyway, or say
something different)

Now we clearly can't provide all such combinations we would need to
effectively model those things as

  \" g  % (somehow at some stage and perhaps in some disguise)


have i lost everybody? hope not

frank

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung