LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Forum View Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Unicode math
From:	Frank Mittelbach <[log in to unmask]>
Reply To:	Mailing list for the LaTeX3 project <[log in to unmask]>
Date:	Wed, 21 May 2014 11:31:59 +0200
Content-Type:	text/plain
Parts/Attachments:	text/plain (119 lines)

In my opinion the Unicode consortium has not screwed up (backspace 

backspace backspace ...) has not found the best possible for math and 

there is no way to *properly* reconcile the two worlds.



Unicode started out as an attempt to codify plain text letters of all 

languages. One of the most important axioms in that respect was the idea 

that a "letter" is an abstract entity, e.g., Latin-small-a and that 

different glyphs in fonts all represent that single entity "a" 

regardless of shape or form it takes. So attributes like bold or 

serif/sans etc are all outside the scope of Unicode encoding.



That makes sense if you try to convey textual meaning. This makes sense 

as "word" has a meaning regardless of being in italics or bold or both. 

(of course such attributes extend the semantics, e.g. bold may indicate 

a heading or italic some emphasis but underlying that "word" still has a 

meaning of its own (in a language).



The problem with math though is that symbols in math are traditionally 

be not just defined by an abstracted shape, but the mathematical 

community early one used additional attributes of glyphs to convey 

semantics. So bold-lowercase-latin-letters may denote vectors and in one 

formula a integral symbol and a bold-integral may have totally different 

semantics. On top of it the semantics may change from field to field or 

even from paper to paper (so other than calling it a bold-integral there 

is not way to describe such symbols semantically).



The problem with this is that mathematicians have come up with using 

effectively any kind of symbol/letter to denote specific semantics and 

long ago started to use all kind of attributes (that unicode on the 

level plain text regards as  irrelevant) to indicate semantics too. The 

main point here then is that the moment that happens the attributes 

become frozen and symbols+attribute become relevant symbols in their own 

right.



As a result to express the language of mathematics unicode would have 

needed to codify all kind of letter/symbol+attribute(s) as individual 

unicode points which is a difficult if not impossible task.



Nevertheless, they went for this approach to some extend by codifying 

mathematical alphabets (mainly digits+a-z+A-Z plus some greek) and of 

course a large number of symbols.



In the unicode book it says:



The alphabets in this block encode only semantic distinction, but not 

which font will be used to supply the actual plain, script, Fraktur 

[...] Characters from the Mathematical Alphanumeric Symbol block are not 

to be used for nonmathematical styled text.



All mathematical alphanumeric symbols have compatibility decompositions 

to the base Latin and Greek letters. This does not imply that the use of 

these characters (I guess the base ones - Frank) is discouraged for 

mathematical use. Folding away such distinctions [..] is usually not 

desirable, however, as it loses the semantic distinction for which these 

characters are encoded.



That is all true and sensible and to explicitly encode that something is 

a math-caligraphic S and not just a Latin-S (that happens to be in some 

caligraphic font) is desirable when passing data from one application to 

the next as the font information is likely to be lost and thus the 

semantics.



However, it is by no means offering a full codification of mathematical 

semantics, so by the end of the day you may end up with a mixture of 

"properly" encoded material + stuff that lost the semantic distinction.



the good part is that it covers a lot but it is not comprehensive by any 

means and can't be due to the approach chosen.



It reminds me a bit of a talk I heard recently where somebody was 

advocating to use sub-superscript unicode digits to avoid having to type 

_2 or ^3 arguing that this is easier and nicer and better readable. Well 

to me it isn't the moment you get to real math because then it gets 

inconsistent and you end up with mixed syntax.



For the same reason believe that it would have been better to approach 

math alphabets differently in unicode and instead of codifying a few 

(with limited letter sets) acknowledge the fact that this "language" has 

a meta level where symbol+attribute encode semantics and not just symbol 

as such.



Anyway this is no here nor there as  this is what unicode offers nowadays.



So where does it fail?



  - in case of attributed mathematical symbols, most prominently using 

bold as offered by the bm package, resulting in new symbols as far as 

the semantics are concerned



  - in case of multi-letter symbols (that require a fixed font (ie 

frozen attributes) but with kerning for aesthetic reason)



  - in case of using alphabets which have not been considered (like two 

distinctive calligraphic alphabets in parallel, or old german \neq 

Fraktur (as my Algebra prof did) or cyrillic or ...



  - in the fact of not supporting diacritics for those alphabets (minor 

case though)



LaTeX2e's math support codified most of the needs of the mathematics 

language  albeit only with its domain (that is within the LaTeX syntax), 

i.e., it wasn't supporting any unicode code points for math (as they 

didn't exist). So something like \mathbf was defining individual bold 

math letters (for which unicode now has its own code point as long as 

they are basic latin) but it was also offering this for word-like 

symbols such as \mathbf{Set}



So if one now maps that to a full fledged text font that supports 

kerning, you lose the code point semantic distinction outside LaTeX and 

if you map it to the unicode plane then you have to manually deal with 

kerning for multi-letter sequence (which is on-trivial and can't be 

perfect) or live with horrible spacing.



Or you need to change the interface in LaTeX and offer different 

commands or you change internals and distinguish between single letter 

and multi-letter arguments. Or ...



frank

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung