## LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

 Options: Use Forum View Use Monospaced Font Show Text Part by Default Condense Mail Headers Message: [<< First] [< Prev] [Next >] [Last >>] Topic: [<< First] [< Prev] [Next >] [Last >>] Author: [<< First] [< Prev] [Next >] [Last >>]

\iffalse

this is coming back to the discussions we had on this list early in
the year. what you see below is a paper which i gave in March in Japan
on a language model for LaTeX.

frank

ps the pager is more than 500 lines so it is split as this list
doesn't allow such long mails

\fi

%
% Copyright 1997 Frank Mittelbach, Chris Rowley
%

\documentclass[a4paper]{article}

\typeout{********************************************}
\typeout{** two pagebreaks hardcoded which might need removal}
\typeout{********************************************}
\flushbottom

\usepackage{shortvrb}  \MakeShortVerb{\|}
\usepackage{epic}

\begin{document}

\title{Language Information in
Structured Documents:\\
A Model for Mark-up and Rendering\thanks
{This paper was originally given at the
Multilingual Information Processing symposium, March 1997,
Tsukuba, Japan.}
}

\date{}

\maketitle

%\tableofcontents

\section{Introduction}

This paper discusses the structure and processing of multi-lingual
documents, both at a general level and in relation to a proposed
extension to the (no longer so new) standard \LaTeX.  Both in general
and in the particular case of this proposal, our work would be
impossible without the enormous support, both practical and moral, we
get from our fellow members of the \LaTeX3 project
team\footnote{Current \LaTeX3 project team members are Johannes Braams
(NL), David Carlisle (UK), Michael Downes (USA), Alan Jeffrey (UK) and
Rainer Sch\"opf (DE).} (who maintain and enhance \LaTeX) and from
people all over the world who contribute to the development of
\LaTeX{} with their suggestions and comments.

The paper starts by examining the language structure of documents and
from this a language tag model for \LaTeX{} is developed.  It then
discusses the relationship between language and document formatting
and the types of actions needed at a change of language.  This will
lead to a model that supports the specification of these actions and
of their association with the tag structure in the abstract document.

The model is then extended to provide the necessary support for
regions that have their own visual context or that receive content
from other parts of the document, thus breaking the basic tree
structure of an abstract document---this is in
section~\ref{sec:moving}.

Finally a high level summary of the required interfaces is given. A
full formal specification, to be used for a prototype implementation
in \LaTeX{}, is currently under development---a first public test
implementation is expected to exist for the 1997/12/01 release of
\LaTeX.

If you are interested in the issues raised in this paper or in other
aspects of our work to enhance \LaTeX, please join the project's electronic
discussion list. To do this, please send a message to:
\begin{quote}
\end{quote}
Containing this line:
\begin{quote}
\end{quote}

\newpage

\section{Language Structure of Documents}

Structured documents can be understood as being explicitly or implicitly
labeled with language tags'' denoting that a portion of the
document contains data written in a certain language''.

These tags have the following properties:
\begin{itemize}
\item They impose on the document a hierarchical tree structure that
may not be compatible with that document's other logical structure,
e.g., there might be a language change in the middle of a logical
element such as a list item.\footnote{However, for practical
purposes it is normally possible and acceptable to artificially
force the structure imposed by the language tags into the logical
hierarchy imposed by other tags.}
\item
At any one point in the document the current language'' can be
determined.
\end{itemize}

The term language'' in this context is somewhat vague and might
need further qualification; but for the purpose of the following
discussion it is sufficient to define it as a label' whose value
affects certain aspects of formatting.

\subsection{Hierarchy of language tags}

The structure created by attaching such language tags to the text can
be considered to be of varying complexity. The simplest case would be
to regard this as a flat structure: for each point in the document
only a current'' language is defined, disregarding the fact that
certain language segments can be considered to be embedded within
others. This model of language within documents is, for example,
employed within the current Babel system where, by default, all
language changes are in this sense global.

In a more complex model each area has a current'' language but may
be embedded within a nest of larger areas, each in its own language.
In such a model, a change of language has a different quality, and
therefore may invoke different formatting changes, depending on the
level in the hierarchy at which it occurs.

Our investigations lead us to conclude that, to properly render a
document, one needs a combination of both models:
\begin{itemize}
\item the concept of a base language for very large portions of a text
(for most documents this will in fact be only one such language for
the full text): this has a flat structure, there is only one base
language at any point in the text;
\item the concept of imbedded language segments: these are nestable (to
any number of levels) and are used for relatively small-scale
insertions within a base language, such as quotations or names.
\end{itemize}

\pagebreak

\subsection{Language tag (visual) structure}
\label{sec:visual}

In addition to the nesting structure of language tags, there is a more
visual component that influences rendering of a document: the
paragraph structure.  To properly model this typographical treatment
it is necessary to classify the language tags according to whether a
language segment contains only complete paragraphs or is part of the
running text of a single paragraph.  A begin/end pair of tags is
called a block-level'' tag if its body consists of complete
paragraphs and a paragraph-level'' tag otherwise.  As later examples
will show, the typographical treatment for these two types is often
different.

\begin{figure}
\centering
\setlength\unitlength{10pt}
\frame{%
%\begin{picture}(32,12)(-2,-1)                   % use commented out
% lines if showing document-level
\begin{picture}(32,10)(-2,-1)

\newcommand\fragment{\begin{picture}(0,0)%
\drawline(0,0)(0,-3)%
\dottedline[$\bullet$]{1}(0,0)(0,-2)%
\dottedline{.2}(-0.2,-2.5)(-0.2,-2.8)%
\end{picture}%
}
\newcommand\fragmentlevels{\makebox(0,0)[rt]{\footnotesize\shortstack[l]{frag-\\ment\strut\\levels}}}

%\drawline(10,10)(10,0)
%\dottedline[$\bullet$]{2}(10,10)(10,0)
\drawline(10,8)(10,0)
\dottedline[$\bullet$]{2}(10,8)(10,0)
%\multiputlist(8,10)(0,-2)[r]{document level,base language level,first
\multiputlist(8,8)(0,-2)[r]{base language level,first
nesting level,second nesting level,\ldots,n\textsuperscript{th}
nesting level}

%\dottedline[$\bullet$]{2}(15,10)(15,10)      % just for a single
% bullet  lazy as i am :-)
%\drawline(15,10)(15,8)(18,2)
\drawline(15,8)(18,2)
\multiput(15,8)(1,-2){3}{\fragment}

% \put(17,10){\makebox(0,0)[l]{document level}}
\multiputlist(17,8)(1,-2)[l]{base language level, paragraph level,nested paragraph
level,\ldots}

\put(14.5,7.2){\fragmentlevels}
\put(15.5,3.3){\fragmentlevels}

\end{picture}%
}
\caption{The two hierarchies}\label{fig:twohs}
\end{figure}

\section{A Tag Model for \LaTeX{}}

To support the above model, including both nesting of language tags
and the differentiation between block- and paragraph-level tags, the
following tag structure for a system like \LaTeX{} is proposed:
\begin{itemize}
\item
A document language tag (implicit). This tag can be used to attach
language-related typographical actions that should not change even
if the document contains more than one base language.

\item
Base-language tags: used only at top-level, no nesting. These tags
denote the major language(s) within a document. In the case of
essentially mono-lingual documents the base language would be the
same as the document language.

\item
Language-block tags: contain complete paragraphs, nestable. These
denote larger imbeddings either directly within the base language or
further down in the nesting hierarchy.

\item
Language-fragment tags: only within paragraphs, nestable. These
denote smaller imbeddings but are otherwise identical to language
block tags.
\end{itemize}

Note that since, at least in the logical structure of a document,
paragraphs can occur within paragraphs, block tags can be nested
within fragment tags.

\subsection{Document interfaces}
\label{sec:newuser}

As \LaTeXe{} does not have built in support for named attributes, its
support for language changes is best implemented by introducing
additional language tags (commands and environments). A concrete
syntax for these tags could include the following:
\begin{itemize}
\item
A preamble declaration for the document language (this is also the
base language in mono-lingual documents) with the language-label as
argument.

\item
A base-language change command with the language-label as
argument.  This command is declarative to highlight the flat
structure of base languages.

\item A language-environment with the language-label as argument and
text as body.  Such an environment starts a new paragraph so as to enforce
the block-level nature of the tag.

\item
A language-command with the language-label and text both as
arguments. In contrast to the environment, this command applies
language-related actions to its second argument, which cannot
directly contain full paragraphs.
\end{itemize}

For \LaTeX3 we shall probably normalize this interface by supporting a
language attribute on appropriate tags. This would allow, for example,
a trivial translation of the language features currently being
proposed for HTML into \LaTeX{} for rendering purposes.  However, even
in that case generic tags for changing language are necessary as
typical documents contain language changes that do not coincide
with the tag boundaries of other logical tags.\footnote{It is proposed
that HTML\,3.2 supports a \texttt{<span>} tag for this purpose.}

\section{Language-dependent Processing}

Setting up the tags tells us only how to encode a multi-lingual
document.  We now need to specify how these tags affect the processing
of the document; how do we attach actions to them?  Before answering
this question we shall first discuss a number of representative
examples of the effects of language on this processing, classified
according to the categories input, transformation and formatting.

The actions shown below are all commonly related to a change of
language within a document.  Nevertheless, it is not the case that
each of them should necessarily be implemented by attaching them
firmly to language changes.  For some it might be more appropriate to
freeze them for the whole document or to attach them to areas within the
document that do not coincide with language boundaries.

\subsection{Input}

\paragraph{Input encodings}
Entering text in a certain language often requires special
input methods (this is especially true for languages with complex
scripts) but even in cases where direct keyboard entry is possible it
is to be used, so as to interpret the source characters correctly. At
present \LaTeX{} supports variable interpretation of the upper
half of the 8-bit plane, thus allowing source text to be 8-bit encoded
in one of the many keyboard encodings used world wide.

\paragraph{short-refs}
With the development of language packages and the subsequent
development of the Babel system, it became common practice to extend
the mark-up language of \LaTeX{} using so called short-refs'' as a
compact method for inputting certain commands. Short-refs are
i.e.~usually |\|', but nevertheless act like commands. That is, they
do not represent the equivalent glyph sequence but have either
additional effects (e.g.,~the punctuation marks in French typography,
which produce additional space) or even denote completely different
actions (e.g.,~|""| for a break point without a hyphen).

In addition to the above short-refs, some \TeX{} fonts implement
short-refs by using (or misusing) the ligature mechanism to implement
arbitrary input syntax, e.g., || generating  or |---| generating
an em-dash.

Short-refs can be used for different purposes:
\begin{itemize}
\item
providing a compact input notation for commonly used textual
commands such as characters with diacritical marks;
\item
providing a compact and readable input notation for special
applications, e.g., |==>| for |\Longrightarrow|;
\item
providing typographical features not otherwise supported
(e.g.,~extra space in front of punctuation characters).
\end{itemize}
The first two items are related to input syntax and not directly
linked to the language of the current text although historically they
have been provided by language packages, e.g., |"a| as a short-ref for
|\"{a}| was implemented by |german.sty| and within Babel its meaning
gets deactivated within regions marked up as belonging to other
languages.

The third item is directly related to language since short-refs of
this type are used to implement a typographic style that is
characteristic of a language in such a way that the user is not
forced to use explicit mark-up in the document.

\subsection{Transformations}

Here, transformations' include only manipulations of the source text
that are independent of formatting information (i.e.~those that act
entirely on the logical document).  Usually such transformations
enrich the document content in one way or the other by using knowledge
stored outside the document source.

\paragraph{Generated text}

This is text that is not directly encoded in the source document but
is produced from tags therein.  Generated text can be classified into
to categories: content-related and structure-related.  Here
content-related text is that generated by tags that can appear
anywhere in the source text (a typical \LaTeX{} example would be the
|\today| command) while structure-related text refers to text that is
associated with a high level logical structure (e.g., the heading
produced for a bibliography or the fixed text used in a figure
caption).

While it is imaginable to keep structure-related text in one language
even though the surrounding language changes, content-related text
most likely will have to change at every language tag.

\paragraph{Hyphenation}

The finding and marking of possible hyphenation points is, perhaps,
the most obvious language-related transformation.  Indeed, it is often
considered to be the defining characteristic of a language'.

When using \TeX{} this relationship is unfortunately obscured by some
technical details of the implementation of hyphenation.  One of these
is that \TeX's hyphenation does not depend only on the language' but
also on the current font encoding (which can differ within a single
language).  Another is \TeX's restriction that one can properly
hyphenate a whole multi-lingual paragraph only if the font encodings
used therein share a single lower-case table (and this is likely not
to be the case if more than one script is present).

\paragraph{Upper- and lower-case transformations}

The mapping between upper- and lower-case characters (for those
writing systems that make such a distinction) is language-dependent
(and not just script-dependent): for example, in Turkish \i{}$\to$I
and i$\to$\.I in contrast to the usual mapping i$\to$I used in most
other languages. There can also be a one-to-many mapping as for the
German \ss{} that maps to SS.
`