LISTSERV mailing list manager LISTSERV 16.0

Help for LATEX-L Archives


LATEX-L Archives

LATEX-L Archives


LATEX-L@LISTSERV.UNI-HEIDELBERG.DE


View:

Message:

[

First

|

Previous

|

Next

|

Last

]

By Topic:

[

First

|

Previous

|

Next

|

Last

]

By Author:

[

First

|

Previous

|

Next

|

Last

]

Font:

Monospaced Font

LISTSERV Archives

LISTSERV Archives

LATEX-L Home

LATEX-L Home

LATEX-L  February 2001

LATEX-L February 2001

Subject:

Re: Multilingual Encodings Summary

From:

Frank Mittelbach <[log in to unmask]>

Reply-To:

Mailing list for the LaTeX3 project <[log in to unmask]>

Date:

Tue, 13 Feb 2001 23:51:05 +0100

Content-Type:

text/plain

Parts/Attachments:

Parts/Attachments

text/plain (452 lines)

Marcel,

 > Hi, the messages on the list over the last couple of days have been
 > pretty encouraging

have they? 10 people left the list :-)

 > The following is an extended summary of the discussion (clearly
 > biased). I encourage everybody to review, change, and extend this

thanks. it is good to get things back into focus even though I don't agree
with a lot of the statements, so here are my views or rather comments to the
individual arguments put together.

 > It's important that we don't keep iterating over the same things,
 > but rather build a solid base of arguments and clarify the design
 > goals.

right


 > --------------------------------------------------------------------
 > 1. Input Encoding and User Interface:
 >
 > 1.1. Current State:
 >
 > Currently, it is difficult to enter non-English or multilingual
 > scripts. Users can either provide an ASCII input file, or select an
 > input encoding. While it is currently possible to produce high
 > quality print in many scripts, there are serious usability problems.

agreed, though it depends on the language. Most Latin based languages do not
have usability problems with respect to input encodings.

 > - Typing ASCII can be very tedious, and makes it hard to proofread the
 > .tex file.

yes, for many scripts, no for most Latin based ones.

 > Portability is good in theory, but since nothing works
 > out of the box can be a pain in practice.

for those languages that LaTeX currently does support (i consider this set as
a subset of the languages supported by Babel) portability is not theory but
practice and for those it does work out of the box. so either be more precise
or remove the second half since "nothing" is clearly wrong

(heh British and American do work perfectly :-)

 > - Setting an input encoding may works well for some languages.

yes

 > However, it's not a solution for multilingual work (unless, for
 > example, UTF8 is the chosen input encoding), few scripts are well
 > supported (even something as simple as ISO-8859-7 for Greek requires
 > fishing on the net to make it work).

the latter is a bad argument because the fact that something is done or not
done and officially part of LaTeX or not has nothing to do with the question
of whether or not the underlying concepts of a system would be sufficient to
support something properly.

 > - In both cases diagnostic messages can be confusing to the point of
 > being useless.

this has nothing really to do with input encodings.


 > 1.2. The Case for UTF8 as Default Input Encoding:
 >
 > There is a good summary at http://www.cl.cam.ac.uk/~mgk25/unicode.html
 > which does not need to repeated in detail here. Basic points:
 >
 > - All the ASCII characters have their usual position in UTF8. In
 > other words, current ASCII .tex files would continue to work without
 > anybody noticing.

yes

 > - UTF8 encodes Unicode which covers virtually all scripts.

including Klingonish (however that it is spelled)


 > - On all major platforms, support for editing and displaying UTF8
 > exists and either is currently moving into mass deployment. Major
 > programming languages have UTF8 libraries, so the basic
 > infrastructure for UTF8 is or will be in place shortly.

remains to be seen. in the long term most likely yes, but how many of the
people on this list can easily (in their favorite editing system) edit or
generate a utf8 encoded file? hands up?

 > - Diagnostic messages could (although not with current TeX engine) be
 > output in the correct script. This would be a major improvement for
 > users. (Is actually more related to the internal encoding, see
 > below.)

again, absolutely nothing to do with input encoding


 > 1.3. Existing Implementations:
 >
 > - There is an implementation for UTF8 input on a TeX engine (xmltex by
 > David Carlisle) that also uses UTF8 as the internal representation.
 >
 > - There also exists a UTF8 option for the inputenc package (more
 > info???).

        http://www.unruh.de/DniQ/latex/unicode/ ,

 > - The "combining characters" of Unicode are difficult to handle with a
 > TeX based parser. (Does "difficult" mean "impossible to get
 > right"??? What are the issues???)

David commented on this. it is technically difficult in the sense that it
would mean, when using TeX, that none of the tokenisation methods of the TeX
parser can be used but a complete different module would be needed. Not at all
impossible in the abstract but means you have to replace 98% of CTAN code
related to LaTeX by newly written code.

 > - TeX based parsers may not handle input errors gracefully (i.e. give
 > meaningful error messages). (Can someone confirm or correct
 > this???)

yes and no. no clear cut here, depends a lot on the effort you put into
them. in reality, probably yes.

 > - Using UTF8 on TeX internally gives a performance hit too big to
 > justify as a default. (Does this apply to the UTF8 inputenc package
 > as well???)

depends a lot on whether or not you mean full utf8 or only utf8 without
character combinations

 > - There is Omega as a native Unicode implementation of TeX. More
 > below.

which doesn't exist on all platforms and not on most commercial implementation


 > 2. Internal Representation and Output Encoding:
 >
 > 2.1. Problems with Current TeX:
 >
 > It has been remarked that TeX does not really have an "internal
 > representation". Rather, TeX keeps text as a string of ASCII
 > characters that are re-parsed through the one-and-only TeX parser
 > whenever something is to be done with it. (TeX gurus: is this
 > simplistic statement essentially correct???)

no (not correct I mean) but I guess this has been discussed by now


 > This leads to a number of problems.
 >
 > - A sufficiently general internal multilingual representation may be
 > impossible to maintain, unless it is Unicode in disguise.

the statement doesn't follow from the argument above (even if the one would be
true or replaced by a more precise statement of TeX's inner working).

but essentially yes since the following statement seems to me true for *any*
system that tries to work with multilingual (or even monolingual) data:

any such system needs to be able to identify character data in a sufficient
precise way which eventually leads to some sort of indexing character data.
So the more scripts/languages you want to be able to manipulate the more you
have to be able to encode which automatically leads to a system which is has
the set size of unicode and which automatically is something that can be
converted to and from Unicode without loss of information (in theory)


 > - Hyphenation patterns are specified in terms of the output encoding.
 > This means that every character appearing in the hyphenation rules
 > must have a physical slot in the selected font.

only in the internal storage format for patterns used within TeX. On the
abstract level this is not at all true even though the source format of
existing patterns tend to be written in this form as well.

 > However, logically
 > hyphenation should not depend on output encoding, and one should be
 > able to mix fonts with different output encodings without losing
 > correct hyphenation.

yes, and it is possible without technical problems (in theory)


 > - It is rather hard to make a new font available under LaTeX.
 > Essentially one must create a virtual font which has all the
 > character slots in the places where hyphenation expects them to be.

wrong.


 > - TeX diagnostic messages output the "internal representation", which
 > can quickly become unreadable for scripts that are not essentially
 > ASCII.

which diagnostics we are talking about here? some of them are in the font
encoding (which is not the LICR at all)

 > - The output encoding is limited to 8 bit fonts, which may not be
 > enough to get correct kerning for some languages. (Can someone
 > confirm or correct this???)

true in some cases.


 > 2.2. How Omega Separates Internal an Output Encoding:
 >
 > (The following is stolen from Javier Bezos)

what Javier describes is not how Omega does it but how he suggests how it
should do it. I would like to discuss that separately as it is getting complex

 > 2.3. Further Issues:
 >
 > - Even with Unicode internally, one probably still needs what is
 > currently used exclusively, namely to have named symbols and other
 > complex objects.

yes and not only for some sort of backwards compatibility

 > This may be fine as long as these don't need
 > hyphenation and nongeneric kerning.

it is not clear to me that need to be a contradiction


 > - How are combining characters handled, in particular when they
 > represent a glyph that has also its own Unicode slot? The main
 > issue is hyphenation. How do Unicode capable word processors handle
 > this?

if they have a unicode slot then they are supposed to represent the same
character, in that case for the sake of processing ease I would vote for
replacing them at entrance state to the single slot representation to ensure
that the internal representation of whatever system has to deal with only one
possible form per character.

 > - Unicode is still changing, especially with respect to math
 > characters. Does this prevent us from getting correct basic
 > infrastructure in place?

no

 > - Requirements for non-European scripts that have not been adequately
 > addressed?

who knows?


 > 3. Alternative Engines:
 >
 > As explained above, the TeX engine has limited capabilities for
 > multilingual typesetting and requires some rather awkward workarounds
 > for non-English languages. Omega with its internal Unicode
 > representation is certainly an alternative. What is the current state
 > of Omega, what are potential problems, and are there other
 > possibilities?
 >
 > - It appears that Omega uses a 16 bit internal representation. Is
 > this a restriction that may cause problems later, when someone finds
 > needed glyphs are outside the 16 bit range?
 >
 > - What is the general state of Omega's TeX compatibility? For
 > example, would LaTeX and a majority of third party packages run
 > unchanged on top of Omega (with or without full Unicode support)?

that depends very very much on whether the omega "high-level format
development" takes a path which diverges very much from the one that LaTeX
(based on TeX) takes. I'm not taking internal implementations here but rather
fundamental changes, for example, in basics for language or font support which
would then conflict with many packages.


 > - If the engine is under discussion, the new engine should be able to
 > provide long-time stability comparable to TeX. So is the basic
 > infrastructure that Omega provides considered solid and general
 > enough for its purpose?

depends on what you target for. For the needs of some people yes and that
already for a long time. with respect to the base of LaTeX users I fear no;
not yet at least.


 > - Would the decision to move beyond TeX cause a feature explosion in
 > the engine that would be difficult to control?

feature changes or additions in the engine are dangerous to one part of the
latex usage, exchangeability of documents, this will be true for a successor
of LaTeX2e as it is now. A successor to LaTeX might be able to pull off by
changing engines but it will rely on a stable system from thereon and it will
need a system supported on all major (and many minor) platforms

 > On the other hand,
 > are there feature in e-TEX, NTS and friends that are deemed
 > essential or highly desirable, but are not provided by Omega?

some have been named, like pdf output (though that is already not supported by
TeX but by a variant)

 > 4. Impact on Mainstream Usage:
 >
 > What would be the impact of all this to Joe User who does nothing but
 > read and write English?
 >
 > - Joe User must install new executables in addition to class and style
 > files when upgrading to LaTeX3. It is likely that he (or she) won't
 > notice as contemporary software packaging will hide this detail.

if you can get Joe User to switch, which is one (if not the) important
obstacle. It will also require that this contemporary software is in place and
this is not just the kernel + tools (on the contrary)


 > - Possibly a minor performance hit due to 16 or 32 bit internal
 > characters. On the other hand, current LaTeX font handling has some
 > pretty noticeable overhead in places (\boldsymbol in amsmath, for
 > example), so if those cases could be handled natively, there may
 > actually be an overall performance improvement.

I wouldn't be concerned about that.

 > - Could type a math paper without saying Schr\"o\-din\-ger all the
 > time.

you can do now.

 > - Won't need to think when receiving a strange .tex file from a friend
 > in China.

you still do since you wouldn't understand the printed version would you? :-)


 > - Availability of different fonts may increase as they would typically
 > not need to be VF re-encoded.

i guess this is a red herring. you need to provide support for a font in any
encoding to make it accessible to a typesetting system. it might be easier,
perhaps but in any case it is a onetime effort.

 > 5. Stability Issues:
 >
 > 5.1. User Interface Stability:
 >
 > - Since UTF8 will work with plain ASCII, there should not be any
 > upgrade problem. Other font encodings could still be explicitly
 > specified.

??? what are you taking about now? what have font encodings to do here?

 > - It is important to make sure that reasonable old LaTeX files run
 > without problems (even if the output is not 100% pixel compatible)
 > to enable users to upgrade easily.

a successor to LaTeX2e will have to make a cut in my opinion or else it will
not be much better. But this means you have to have really good selling
arguments so that people are actually using it.

nevertheless one should try to provide some compatibility but probably less
than we tried while switching from 209 to 2e


 > 6. Multilingual Support vs. Other Design Goals of LaTeX3:
 >
 > LaTeX2e works pretty well as an authoring and communication tool for
 > technical and scientific writing. Areas for intended improvement (very
 > sketchy right now...).
 >
 > - Better Class/Package designer interface.

that will hopefully come with the template design structures now

 > - Better font support???

anything in mind?

 > - Internationalization???

meaning?

 > 7. "Soft Arguments":
 >
 > - Leaving the well-known world of TeX causes fear and uncertainty. In
 > particular, it is not clear what precisely should come after TeX,
 > and there is the danger of obsoleting a lot of past work.

no really, it is rather not getting the users in the first place.

 > - Judging from past release schedules, LaTeX will receive a major
 > upgrade about once every 10 years. So if we wait until 2014 to get
 > state-of-the art international support, we may lose a lot of
 > potential users.

why 2014, shouldn't that been 2002/3?


 > - Basing LaTeX on Omega poses a hen-and-egg problem that will not go
 > away automagically. Omega will only become completely stable if
 > there is unequivocal support from the user interface community
 > (i.e. the LaTeX people) and LaTeX needs the Omega backend to become
 > a serious multilingual typesetting system.

true (more or less in my view) which is why I'm seriously concerned about a
diversion in the paths now (and try to prevent it). In my opinion it would
bring Omega into a corner not getting enough users and wouldn't be good for
LaTeX on TeX either. But I think it would be possible to build on identical
principles even if some of the kernel is technically differently
solved. however, if the outer and inner (the slightly higher) interfaces are
identical then you can get both groups users together until it is possible to
switch.

 > - Unicode is currently receiving a lot of attention and publicity. So
 > it may be advantageous to ride that wave, in particular as it seems
 > technically sound.

it is technically in my opinion a mess as with all standards that embrace
legacy standards below it partly has to be. but that doesn't mean it isn't the
best thing you can get or the thing you should avoid. on the contrary.


 > 8. Summary:
 >
 > - Unicode on TeX (default or optional): Too much of a mess, poor
 > performance, and probably difficult to get completely right without
 > invoking TeX as a Turing machine.

i'm not sure that this follows. i think that the LICR could not be unicode as
such (sensibly) but i think one should try for finding an LICR that can be
incorporated in an Omega LICR in a way that it would be transparent for those
parts of LaTeX which do not deal with features Omega provides alone.

this is what one should try at the current moment in my opinion to avoid too
much non-compatible development.


 > - Unicode on Omega (default but blindly compatible): Seems essentially
 > the right thing to do (no strong argument against), but still lots
 > of questions.

again, even if I repeat myself within half a page: I think LaTeX on Omega
should have an LICR which is identical to the LICR of LaTeX on TeX in most
views. I will expand on that when I comment on Javier's description on how
LaTeX's LICR works and how he wants to see Omega's being built (but not
tonight ...)

good night
frank

Top of Message | Previous Page | Permalink

Advanced Options


Options

Log In

Log In

Get Password

Get Password


Search Archives

Search Archives


Subscribe or Unsubscribe

Subscribe or Unsubscribe


Archives

September 2019
July 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
June 2018
May 2018
April 2018
February 2018
January 2018
December 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
July 2016
April 2016
March 2016
February 2016
January 2016
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
September 2012
August 2012
July 2012
June 2012
May 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
September 2007
August 2007
June 2007
May 2007
March 2007
December 2006
November 2006
October 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
November 2005
October 2005
September 2005
August 2005
May 2005
April 2005
March 2005
November 2004
October 2004
August 2004
July 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
October 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003
December 2002
October 2002
September 2002
August 2002
July 2002
June 2002
March 2002
December 2001
October 2001
September 2001
August 2001
July 2001
June 2001
May 2001
April 2001
March 2001
February 2001
January 2001
December 2000
November 2000
October 2000
September 2000
August 2000
July 2000
May 2000
April 2000
March 2000
February 2000
January 2000
December 1999
November 1999
October 1999
September 1999
August 1999
May 1999
April 1999
March 1999
February 1999
January 1999
December 1998
November 1998
October 1998
September 1998
August 1998
July 1998
June 1998
May 1998
April 1998
March 1998
February 1998
January 1998
December 1997
November 1997
October 1997
September 1997
August 1997
July 1997
June 1997
May 1997
April 1997
March 1997
February 1997
January 1997
December 1996

ATOM RSS1 RSS2



LISTSERV.UNI-HEIDELBERG.DE

Universität Heidelberg | Impressum | Datenschutzerklärung

CataList Email List Search Powered by the LISTSERV Email List Manager