LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

Javier Bezos writes:
 > If I say
 >
 >   \begin{mandarin}
 >     \newcommand{\foo}{<Unicode char corresponding to Chinese ai>}
 >   \end{mandarin}
 >
 > how TeX knows that \foo\ was defined in a Mandarin context (including
 > perhaps input encoding information)? And what is expected by the user,
 > that the Chinese char should be considered "conceptual" (thus rendered
 > differently in Japanese and Mandarin) or that the Chinese char must be
 > rendered with the simplified ideogram (ie, Mandarin vs. Japanese)?
 > What makes that different from, say,
 >   \newcommand{\foo}{\unichar{<Unicode code>}}
 > (without specifying the language)?

Oh, looks like I fell into the eurocentric mind-trap that
"character=glyph"...

So it looks like there are a couple of strategies:

1. Store the full language context with every character token sequence
   along the lines that Javier suggests.  In other words, treat the
   language context as part of the input encoding.  It would seem that
   if Frank's requirement for an ICR ("a single item must have a
   unique and well-defined meaning") is to be met, it would
   essentially mean that every character needs to be tagged for
   language context.

2. Treat input encoding completely separate from language context.
   Input encoding just determines how to get from an arbitrary
   encoding to the Unicode(-like) ICR.  Thus, switches in the language
   context have to be tagged explicitly by the user.  So the example
   would become

     \begin{utf8-encoding}
       \newcommand{\foo}{<Unicode char corresponding to Chinese ai>}
     \end{utf8-encoding}
     Now I have to say something like \mandarin{\foo} or
     \japanese{\foo}.  Of course, putting the language switch into the
     definition of \foo would be legal, too.

  The main restriction of this approach is that we cannot (easily) do
  something like

     \begin{mandarin}
       \section{<...>}
     \end{mandarin}
     \begin{japanese}
       \section{<...>
     \end{japanese}

  and expect that the language context is properly preserved in the
  TOC.

  a) Is it reasonable and necessary at all for this example to work,
     i.e. that a TOC or index should mix languages "automatically"?

  b) If the "japanese" in the second example would be "english", one
     could simply "stack" language context globally.  I.e., below the
     primary language we can have an arbitrary number of working
     languages which only determine features which languages higher in
     the hierarchy have not explicitly defined (such as rendering of
     glyphs in certain Unicode regions).  So only in cases when there
     are conflicting choices (japanese vs. mandarin, for example) we
     need local mark-up:

     \section{\japanese{<...>}}

3. Extreme version of 2 (the only strategy that seems to be cleanly
   implementable on current Omega):

   We simply define the \InputTranslation to be fixed on a per-file
   basis.  In other words, we acknowledge that it does not make any
   sense in terms of usability to mix input encodings, as such files
   simply cannot (and should not) be displayed cleanly in any editor.
   So preparing multiencoded text must proceed along the following
   options:

   a) Split text into several files.  (Useful for blocks of original
      source which is not subject to frequent modification.)

   b) Use UTF-8 and rely on the editor for encoding translation during
      import.  (For example, the Emacs command insert-file-contents
      can do coding translation; we should also expect that
      drag-and-drop protocols of various windowing systems will
      eventually be able to do this properly).

   c) For legacy source, the functionality of current inputenc could
      be provided independent of the particular ICR.

--Marcel