## LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

 Options: Use Classic View Use Proportional Font Show HTML Part by Default Show All Mail Headers Topic: [<< First] [< Prev] [Next >] [Last >>]

 Re: Unicode math Frank Mittelbach <[log in to unmask]> Wed, 21 May 2014 11:31:59 +0200 text/plain (119 lines) In my opinion the Unicode consortium has not screwed up (backspace backspace backspace ...) has not found the best possible for math and there is no way to *properly* reconcile the two worlds. Unicode started out as an attempt to codify plain text letters of all languages. One of the most important axioms in that respect was the idea that a "letter" is an abstract entity, e.g., Latin-small-a and that different glyphs in fonts all represent that single entity "a" regardless of shape or form it takes. So attributes like bold or serif/sans etc are all outside the scope of Unicode encoding. That makes sense if you try to convey textual meaning. This makes sense as "word" has a meaning regardless of being in italics or bold or both. (of course such attributes extend the semantics, e.g. bold may indicate a heading or italic some emphasis but underlying that "word" still has a meaning of its own (in a language). The problem with math though is that symbols in math are traditionally be not just defined by an abstracted shape, but the mathematical community early one used additional attributes of glyphs to convey semantics. So bold-lowercase-latin-letters may denote vectors and in one formula a integral symbol and a bold-integral may have totally different semantics. On top of it the semantics may change from field to field or even from paper to paper (so other than calling it a bold-integral there is not way to describe such symbols semantically). The problem with this is that mathematicians have come up with using effectively any kind of symbol/letter to denote specific semantics and long ago started to use all kind of attributes (that unicode on the level plain text regards as irrelevant) to indicate semantics too. The main point here then is that the moment that happens the attributes become frozen and symbols+attribute become relevant symbols in their own right. As a result to express the language of mathematics unicode would have needed to codify all kind of letter/symbol+attribute(s) as individual unicode points which is a difficult if not impossible task. Nevertheless, they went for this approach to some extend by codifying mathematical alphabets (mainly digits+a-z+A-Z plus some greek) and of course a large number of symbols. In the unicode book it says: The alphabets in this block encode only semantic distinction, but not which font will be used to supply the actual plain, script, Fraktur [...] Characters from the Mathematical Alphanumeric Symbol block are not to be used for nonmathematical styled text. All mathematical alphanumeric symbols have compatibility decompositions to the base Latin and Greek letters. This does not imply that the use of these characters (I guess the base ones - Frank) is discouraged for mathematical use. Folding away such distinctions [..] is usually not desirable, however, as it loses the semantic distinction for which these characters are encoded. That is all true and sensible and to explicitly encode that something is a math-caligraphic S and not just a Latin-S (that happens to be in some caligraphic font) is desirable when passing data from one application to the next as the font information is likely to be lost and thus the semantics. However, it is by no means offering a full codification of mathematical semantics, so by the end of the day you may end up with a mixture of "properly" encoded material + stuff that lost the semantic distinction. the good part is that it covers a lot but it is not comprehensive by any means and can't be due to the approach chosen. It reminds me a bit of a talk I heard recently where somebody was advocating to use sub-superscript unicode digits to avoid having to type _2 or ^3 arguing that this is easier and nicer and better readable. Well to me it isn't the moment you get to real math because then it gets inconsistent and you end up with mixed syntax. For the same reason believe that it would have been better to approach math alphabets differently in unicode and instead of codifying a few (with limited letter sets) acknowledge the fact that this "language" has a meta level where symbol+attribute encode semantics and not just symbol as such. Anyway this is no here nor there as this is what unicode offers nowadays. So where does it fail?   - in case of attributed mathematical symbols, most prominently using bold as offered by the bm package, resulting in new symbols as far as the semantics are concerned   - in case of multi-letter symbols (that require a fixed font (ie frozen attributes) but with kerning for aesthetic reason)   - in case of using alphabets which have not been considered (like two distinctive calligraphic alphabets in parallel, or old german \neq Fraktur (as my Algebra prof did) or cyrillic or ...   - in the fact of not supporting diacritics for those alphabets (minor case though) LaTeX2e's math support codified most of the needs of the mathematics language albeit only with its domain (that is within the LaTeX syntax), i.e., it wasn't supporting any unicode code points for math (as they didn't exist). So something like \mathbf was defining individual bold math letters (for which unicode now has its own code point as long as they are basic latin) but it was also offering this for word-like symbols such as \mathbf{Set} So if one now maps that to a full fledged text font that supports kerning, you lose the code point semantic distinction outside LaTeX and if you map it to the unicode plane then you have to manually deal with kerning for multi-letter sequence (which is on-trivial and can't be perfect) or live with horrible spacing. Or you need to change the interface in LaTeX and offer different commands or you change internals and distinguish between single letter and multi-letter arguments. Or ... frank