LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Classic View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Topic:	[<< First] [< Prev] [Next >] [Last >>]

Re: Multilingual Encodings Summary 2.2

Lars Hellström <[log in to unmask]>

Sun, 13 May 2001 23:46:03 +0200

text/plain (122 lines)

At 20.55 +0200 2001-05-13, Javier Bezos wrote:
>>>And
>>>regarding font transformation, they should be handled by fonts, but
>>>the main problem is that metric information (ie, tfm) cannot be
>>>modified from within TeX, except a few parameters; I really wonder
>>>if allowing more changes, mainly ligatures, is feasible (that
>>>solution would be better than font ocp's and vf's, I think).
>>
>> I don't understand this. What kind of font transformations are you
>> referring to?
>
>For example, removing the fi ligature in Turkish. Or using an alternate
>ortography in languages with contextual analysis.

That doesn't seem like metric transformations to me, but more like
exchanging some sequences of slots with others. I thought Omega employed
OCPs (which could be selected by the document) for this?

>>>Semantically or visually?
>>
>> I suspect Frank considers meaning to be a semantic concept, not a visual.
>
>I also suspect that, but then if we pick a char it will be
>undefined visually and its rendering (and TeX is essentially about
>rendering) will need _always_ additional information about the context
>(example: traditional idiograms in Japanese vs. simplified ones in
>Chinese).

In think the following quote from the Unicode standard (p. 261) answers that:

  There is some concern that unifying Han characters may lead to confusion
  because they are sometimes used differently by the various East Asian
  languages. Computationally, Han character unification presents no more
  difficulty than employing a single Latin character set that is used to
  write languages as different as English and French.

If they are not different in Unicode then there probably is no reason to
make them different in LaTeX either.

>Further, by doing so we are creating again a closed system
>using its own conventions with no links with external tools adapted
>to Unicode. I will be able to process a file and extract information
>from it with, say, Python very easily if they use a known representation
>(iso encodings or Unicode), but if we have to parse things like \japaneseai
>or similar, things become more difficult.

Agreed.

>  I think it's a lot easier
>moving information with blocks of text and not with single chars.

Depends on what type of information it is. For information specifying the
language almost certainly yes. If you want to move around information
saying "the 8-bit characters in this piece of text should be interpreted
according to the following input encoding" then I would say no (amongst
other things because it would constitute a representation not known to
other programs).

>I don't understand why we cannot determine the current language
>context--either I'm missing something or I'm very optimistic about
>the capabilities of TeX. Please, could you give an example where
>the current language cannot be determined and/or moved?

It's not my area of expertise, so I may be wrong, but I suspect there are
well-known examples. The problem is mainly that the current context is a
rather fuzzy concept; there are other aspects of it than the language.
Thoroughly thinking things through might however well produce a model where
the current context is easy to determine and pass around.

>>>> But such characters (the Spanish as well as the Hebrew) aren't allowed in
>>>> names in LaTeX!
>>>
>>>But they should be allowed in the future in we want a true
>>>multilingual environment.
>>
>> Why? They are not part of any text, but part of the markup!
>
>Are you suggesting that Japaneses, Chineses, Tibetans, Arabs,
>Persians, Greeks, Russians, etc. must use the Latin alphabet *always*?
>That's not truly multilingual--maybe of interest for Occidental
>scholars, but not for people actually using these scripts and
>keyboards with these scripts.

Good point! I hadn't thought of that.

>(Particularly messy is mixing
>right to left scripts with Latin.)

Because of limitations in the editors or because of something else?

>> Isn't the \char primitive in Omega be able to produce arbitrary characters
>> (at least arbitrary characters in the basic multilingual plane)?
>
>Not exactly. The \char primitive is a char, but not intrinsically
>Unicode--ocp's are also applied to \char (and therefore they are
>transcoded).

Why should there exist characters which are not encoded using Unicode en
route from the mouth to the stomach, if we're anyway using Unicode for e.g.
hyphenation?

>> It looks quite reasonable to me, and it is certainly much better than the
>> processing depicted in the example. Does this mean that the example should
>> rather be
>>
>>     A     B        C          D        E
>>    \'e   \'e   e^^^^0301   ^^^^00e9   ^^e9
>
>As currently implemented, yes, it should.

Good, then we've straightened that out! Now what about the other example
line (explicit "82 from column A)?

>I'm not still sure if normalizing
>in this way is the best solution. However, I find the arguments in the
>Unicode book in favour of it quite convincing.

Exactly in what way normalization should be applied and when clearly needs
further study.

Lars Hellstr�m

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung