Date:
Sun, 27 May 2001 13:12:21 +0200
|
Javier Bezos writes:
> If I say
>
> \begin{mandarin}
> \newcommand{\foo}{<Unicode char corresponding to Chinese ai>}
> \end{mandarin}
>
> how TeX knows that \foo\ was defined in a Mandarin context (including
> perhaps input encoding information)? And what is expected by the user,
> that the Chinese char should be considered "conceptual" (thus rendered
> differently in Japanese and Mandarin) or that the Chinese char must be
> rendered with the simplified ideogram (ie, Mandarin vs. Japanese)?
> What makes that different from, say,
> \newcommand{\foo}{\unichar{<Unicode code>}}
> (without specifying the language)?
Oh, looks like I fell into the eurocentric mind-trap that
"character=glyph"...
So it looks like there are a couple of strategies:
1. Store the full language context with every character token sequence
along the lines that Javier suggests. In other words, treat the
language context as part of the input encoding. It would seem that
if Frank's requirement for an ICR ("a single item must have a
unique and well-defined meaning") is to be met, it would
essentially mean that every character needs to be tagged for
language context.
2. Treat input encoding completely separate from language context.
Input encoding just determines how to get from an arbitrary
encoding to the Unicode(-like) ICR. Thus, switches in the language
context have to be tagged explicitly by the user. So the example
would become
\begin{utf8-encoding}
\newcommand{\foo}{<Unicode char corresponding to Chinese ai>}
\end{utf8-encoding}
Now I have to say something like \mandarin{\foo} or
\japanese{\foo}. Of course, putting the language switch into the
definition of \foo would be legal, too.
The main restriction of this approach is that we cannot (easily) do
something like
\begin{mandarin}
\section{<...>}
\end{mandarin}
\begin{japanese}
\section{<...>
\end{japanese}
and expect that the language context is properly preserved in the
TOC.
a) Is it reasonable and necessary at all for this example to work,
i.e. that a TOC or index should mix languages "automatically"?
b) If the "japanese" in the second example would be "english", one
could simply "stack" language context globally. I.e., below the
primary language we can have an arbitrary number of working
languages which only determine features which languages higher in
the hierarchy have not explicitly defined (such as rendering of
glyphs in certain Unicode regions). So only in cases when there
are conflicting choices (japanese vs. mandarin, for example) we
need local mark-up:
\section{\japanese{<...>}}
3. Extreme version of 2 (the only strategy that seems to be cleanly
implementable on current Omega):
We simply define the \InputTranslation to be fixed on a per-file
basis. In other words, we acknowledge that it does not make any
sense in terms of usability to mix input encodings, as such files
simply cannot (and should not) be displayed cleanly in any editor.
So preparing multiencoded text must proceed along the following
options:
a) Split text into several files. (Useful for blocks of original
source which is not subject to frequent modification.)
b) Use UTF-8 and rely on the editor for encoding translation during
import. (For example, the Emacs command insert-file-contents
can do coding translation; we should also expect that
drag-and-drop protocols of various windowing systems will
eventually be able to do this properly).
c) For legacy source, the functionality of current inputenc could
be provided independent of the particular ICR.
--Marcel
|
|
|