LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives
Options:	Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]
Subject:	Re: Multilingual Encodings Summary
From:	Frank Mittelbach <[log in to unmask]>
Reply To:	Mailing list for the LaTeX3 project <[log in to unmask]>
Date:	Tue, 13 Feb 2001 23:51:05 +0100
Content-Type:	text/plain
Parts/Attachments:	text/plain (453 lines)
Marcel,

 > Hi, the messages on the list over the last couple of days have been
 > pretty encouraging

have they? 10 people left the list :-)

 > The following is an extended summary of the discussion (clearly
 > biased).  I encourage everybody to review, change, and extend this

thanks. it is good to get things back into focus even though I don't agree
with a lot of the statements, so here are my views or rather comments to the
individual arguments put together.

 > It's important that we don't keep iterating over the same things,
 > but rather build a solid base of arguments and clarify the design
 > goals.

right


 > --------------------------------------------------------------------
 > 1. Input Encoding and User Interface:
 >
 > 1.1. Current State:
 >
 > Currently, it is difficult to enter non-English or multilingual
 > scripts.  Users can either provide an ASCII input file, or select an
 > input encoding.  While it is currently possible to produce high
 > quality print in many scripts, there are serious usability problems.

agreed, though it depends on the language. Most Latin based languages do not
have usability problems with respect to input encodings.

 > - Typing ASCII can be very tedious, and makes it hard to proofread the
 >   .tex file.

yes, for many scripts, no for most Latin based ones.

 >   Portability is good in theory, but since nothing works
 >   out of the box can be a pain in practice.

for those languages that LaTeX currently does support (i consider this set as
a subset of the languages supported by Babel) portability is not theory but
practice and for those it does work out of the box. so either be more precise
or remove the second half since "nothing" is clearly wrong

(heh British and American do work perfectly :-)

 > - Setting an input encoding may works well for some languages.

yes

 >   However, it's not a solution for multilingual work (unless, for
 >   example, UTF8 is the chosen input encoding), few scripts are well
 >   supported (even something as simple as ISO-8859-7 for Greek requires
 >   fishing on the net to make it work).

the latter is a bad argument because the fact that something is done or not
done and officially part of LaTeX or not has nothing to do with the question
of whether or not the underlying concepts of a system would be sufficient to
support something properly.

 > - In both cases diagnostic messages can be confusing to the point of
 >   being useless.

this has nothing really to do with input encodings.


 > 1.2. The Case for UTF8 as Default Input Encoding:
 >
 > There is a good summary at http://www.cl.cam.ac.uk/~mgk25/unicode.html
 > which does not need to repeated in detail here.  Basic points:
 >
 > - All the ASCII characters have their usual position in UTF8.  In
 >   other words, current ASCII .tex files would continue to work without
 >   anybody noticing.

yes

 > - UTF8 encodes Unicode which covers virtually all scripts.

including Klingonish (however that it is spelled)


 > - On all major platforms, support for editing and displaying UTF8
 >   exists and either is currently moving into mass deployment.  Major
 >   programming languages have UTF8 libraries, so the basic
 >   infrastructure for UTF8 is or will be in place shortly.

remains to be seen. in the long term most likely yes, but how many of the
people on this list can easily (in their favorite editing system) edit or
generate a utf8 encoded file? hands up?

 > - Diagnostic messages could (although not with current TeX engine) be
 >   output in the correct script.  This would be a major improvement for
 >   users.  (Is actually more related to the internal encoding, see
 >   below.)

again, absolutely nothing to do with input encoding


 > 1.3. Existing Implementations:
 >
 > - There is an implementation for UTF8 input on a TeX engine (xmltex by
 >   David Carlisle) that also uses UTF8 as the internal representation.
 >
 > - There also exists a UTF8 option for the inputenc package (more
 >   info???).

        http://www.unruh.de/DniQ/latex/unicode/ ,

 > - The "combining characters" of Unicode are difficult to handle with a
 >   TeX based parser.  (Does "difficult" mean "impossible to get
 >   right"???  What are the issues???)

David commented on this. it is technically difficult in the sense that it
would mean,  when using TeX, that none of the tokenisation methods of the TeX
parser can be used but a complete different module would be needed. Not at all
impossible in the abstract but means you have to replace 98% of CTAN code
related to LaTeX by newly written code.

 > - TeX based parsers may not handle input errors gracefully (i.e. give
 >   meaningful error messages).  (Can someone confirm or correct
 >   this???)

yes and no. no clear cut here, depends a lot on the effort you put into
them. in reality, probably yes.

 > - Using UTF8 on TeX internally gives a performance hit too big to
 >   justify as a default.  (Does this apply to the UTF8 inputenc package
 >   as well???)

depends a lot on whether or not you mean full utf8 or only utf8 without
character combinations

 > - There is Omega as a native Unicode implementation of TeX.  More
 >   below.

which doesn't exist on all platforms and not on most commercial implementation


 > 2. Internal Representation and Output Encoding:
 >
 > 2.1. Problems with Current TeX:
 >
 > It has been remarked that TeX does not really have an "internal
 > representation".  Rather, TeX keeps text as a string of ASCII
 > characters that are re-parsed through the one-and-only TeX parser
 > whenever something is to be done with it.  (TeX gurus: is this
 > simplistic statement essentially correct???)

no (not correct I mean) but I guess this has been discussed by now


 > This leads to a number of problems.
 >
 > - A sufficiently general internal multilingual representation may be
 >   impossible to maintain, unless it is Unicode in disguise.

the statement doesn't follow from the argument above (even if the one would be
true or replaced by a more precise statement of TeX's inner working).

but essentially yes since the following statement seems to me true for *any*
system that tries to work with multilingual (or even monolingual) data:

any such system needs to be able to identify character data in a sufficient
precise way which eventually leads to some sort of indexing character data.
So the more scripts/languages you want to be able to manipulate the more you
have to be able to encode which automatically leads to a system which is has
the set size of unicode and which automatically is something that can be
converted to and from Unicode without loss of information (in theory)


 > - Hyphenation patterns are specified in terms of the output encoding.
 >   This means that every character appearing in the hyphenation rules
 >   must have a physical slot in the selected font.

only in the internal storage format for patterns used within TeX. On the
abstract level this is not at all true even though the source format of
existing patterns tend to be written in this form as well.

 >   However, logically
 >   hyphenation should not depend on output encoding, and one should be
 >   able to mix fonts with different output encodings without losing
 >   correct hyphenation.

yes, and it is possible without technical problems (in theory)


 > - It is rather hard to make a new font available under LaTeX.
 >   Essentially one must create a virtual font which has all the
 >   character slots in the places where hyphenation expects them to be.

wrong.


 > - TeX diagnostic messages output the "internal representation", which
 >   can quickly become unreadable for scripts that are not essentially
 >   ASCII.

which diagnostics we are talking about here? some of them are in the font
encoding (which is not the LICR at all)

 > - The output encoding is limited to 8 bit fonts, which may not be
 >   enough to get correct kerning for some languages. (Can someone
 >   confirm or correct this???)

true in some cases.


 > 2.2. How Omega Separates Internal an Output Encoding:
 >
 > (The following is stolen from Javier Bezos)

what Javier describes is not how Omega does it but how he suggests how it
should do it. I would like to discuss that separately as it is getting complex

 > 2.3. Further Issues:
 >
 > - Even with Unicode internally, one probably still needs what is
 >   currently used exclusively, namely to have named symbols and other
 >   complex objects.

yes and not only for some sort of backwards compatibility

 >   This may be fine as long as these don't need
 >   hyphenation and nongeneric kerning.

it is not clear to me that need to be a contradiction


 > - How are combining characters handled, in particular when they
 >   represent a glyph that has also its own Unicode slot?  The main
 >   issue is hyphenation.  How do Unicode capable word processors handle
 >   this?

if they have a unicode slot then they are supposed to represent the same
character, in that case for the sake of processing ease I would vote for
replacing them at entrance state to the single slot representation to ensure
that the internal representation of whatever system has to deal with only one
possible form per character.

 > - Unicode is still changing, especially with respect to math
 >   characters.  Does this prevent us from getting correct basic
 >   infrastructure in place?

no

 > - Requirements for non-European scripts that have not been adequately
 >   addressed?

who knows?


 > 3. Alternative Engines:
 >
 > As explained above, the TeX engine has limited capabilities for
 > multilingual typesetting and requires some rather awkward workarounds
 > for non-English languages.  Omega with its internal Unicode
 > representation is certainly an alternative.  What is the current state
 > of Omega, what are potential problems, and are there other
 > possibilities?
 >
 > - It appears that Omega uses a 16 bit internal representation.  Is
 >   this a restriction that may cause problems later, when someone finds
 >   needed glyphs are outside the 16 bit range?
 >
 > - What is the general state of Omega's TeX compatibility?  For
 >   example, would LaTeX and a majority of third party packages run
 >   unchanged on top of Omega (with or without full Unicode support)?

that depends very very much on whether the omega "high-level format
development" takes a path which diverges very much from the one that LaTeX
(based on TeX) takes. I'm not taking internal implementations here but rather
fundamental changes, for example, in basics for language or font support which
would then conflict with many packages.


 > - If the engine is under discussion, the new engine should be able to
 >   provide long-time stability comparable to TeX.  So is the basic
 >   infrastructure that Omega provides considered solid and general
 >   enough for its purpose?

depends on what you target for. For the needs of some people yes and that
already for a long time.  with respect to the base of LaTeX users I fear no;
not yet at least.


 > - Would the decision to move beyond TeX cause a feature explosion in
 >   the engine that would be difficult to control?

feature changes or additions in the engine are dangerous to one part of the
latex usage, exchangeability of documents, this will be true for a successor
of LaTeX2e as it is now. A successor to LaTeX might be able to pull off by
changing engines but it will rely on a stable system from thereon and it will
need a system supported on all major (and many minor) platforms

 >   On the other hand,
 >   are there feature in e-TEX, NTS and friends that are deemed
 >   essential or highly desirable, but are not provided by Omega?

some have been named, like pdf output (though that is already not supported by
TeX but by a variant)

 > 4. Impact on Mainstream Usage:
 >
 > What would be the impact of all this to Joe User who does nothing but
 > read and write English?
 >
 > - Joe User must install new executables in addition to class and style
 >   files when upgrading to LaTeX3.  It is likely that he (or she) won't
 >   notice as contemporary software packaging will hide this detail.

if you can get Joe User to switch, which is one (if not the) important
obstacle. It will also require that this contemporary software is in place and
this is not just the kernel + tools (on the contrary)


 > - Possibly a minor performance hit due to 16 or 32 bit internal
 >   characters.  On the other hand, current LaTeX font handling has some
 >   pretty noticeable overhead in places (\boldsymbol in amsmath, for
 >   example), so if those cases could be handled natively, there may
 >   actually be an overall performance improvement.

I wouldn't be concerned about that.

 > - Could type a math paper without saying Schr\"o\-din\-ger all the
 >   time.

you can do now.

 > - Won't need to think when receiving a strange .tex file from a friend
 >   in China.

you still do since you wouldn't understand the printed version would you? :-)


 > - Availability of different fonts may increase as they would typically
 >   not need to be VF re-encoded.

i guess this is a red herring. you need to provide support for a font in any
encoding to make it accessible to a typesetting system. it might be easier,
perhaps but in any case it is a onetime effort.

 > 5. Stability Issues:
 >
 > 5.1. User Interface Stability:
 >
 > - Since UTF8 will work with plain ASCII, there should not be any
 >   upgrade problem.  Other font encodings could still be explicitly
 >   specified.

??? what are you taking about now? what have font encodings to do here?

 > - It is important to make sure that reasonable old LaTeX files run
 >   without problems (even if the output is not 100% pixel compatible)
 >   to enable users to upgrade easily.

a successor to LaTeX2e will have to make a cut in my opinion or else it will
not be much better. But this means you have to have really good selling
arguments so that people are actually using it.

nevertheless one should try to provide some compatibility but probably less
than we tried while switching from 209 to 2e


 > 6. Multilingual Support vs. Other Design Goals of LaTeX3:
 >
 > LaTeX2e works pretty well as an authoring and communication tool for
 > technical and scientific writing.  Areas for intended improvement (very
 > sketchy right now...).
 >
 > - Better Class/Package designer interface.

that will hopefully come with the template design structures now

 > - Better font support???

anything in mind?

 > - Internationalization???

meaning?

 > 7. "Soft Arguments":
 >
 > - Leaving the well-known world of TeX causes fear and uncertainty.  In
 >   particular, it is not clear what precisely should come after TeX,
 >   and there is the danger of obsoleting a lot of past work.

no really, it is rather not getting the users in the first place.

 > - Judging from past release schedules, LaTeX will receive a major
 >   upgrade about once every 10 years.  So if we wait until 2014 to get
 >   state-of-the art international support, we may lose a lot of
 >   potential users.

why 2014, shouldn't that been 2002/3?


 > - Basing LaTeX on Omega poses a hen-and-egg problem that will not go
 >   away automagically.  Omega will only become completely stable if
 >   there is unequivocal support from the user interface community
 >   (i.e. the LaTeX people) and LaTeX needs the Omega backend to become
 >   a serious multilingual typesetting system.

true (more or less in my view) which is why I'm seriously concerned about a
diversion in the paths now (and try to prevent it). In my opinion it would
bring Omega into a corner not getting enough users and wouldn't be good for
LaTeX on TeX either. But I think it would be possible to build on identical
principles even if some of the kernel is technically differently
solved. however, if the outer and inner (the slightly higher) interfaces are
identical then  you can get both groups users together until it is possible to
switch.

 > - Unicode is currently receiving a lot of attention and publicity.  So
 >   it may be advantageous to ride that wave, in particular as it seems
 >   technically sound.

it is technically in my opinion a mess as with all standards that embrace
legacy standards below it partly has to be. but that doesn't mean it isn't the
best thing you can get or the thing you should avoid. on the contrary.


 > 8. Summary:
 >
 > - Unicode on TeX (default or optional): Too much of a mess, poor
 >   performance, and probably difficult to get completely right without
 >   invoking TeX as a Turing machine.

i'm not sure that this follows. i think that the LICR could not be unicode as
such (sensibly) but i think one should try for finding an LICR that can be
incorporated in an Omega LICR in a way that it would be transparent for those
parts of LaTeX which do not deal with features Omega provides alone.

this is what one should try at the current moment in my opinion to avoid too
much non-compatible development.


 > - Unicode on Omega (default but blindly compatible): Seems essentially
 >   the right thing to do (no strong argument against), but still lots
 >   of questions.

again, even if I repeat myself within half a page: I think LaTeX on Omega
should have an LICR which is identical to the LICR of LaTeX on TeX in most
views. I will expand on that when I comment on Javier's description on how
LaTeX's LICR works and how he wants to see Omega's being built (but not
tonight ...)

good night
frank
ATOM RSS1 RSS2
LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung