LISTSERV mailing list manager LISTSERV 16.0

Help for LATEX-L Archives

LATEX-L Archives

LATEX-L Archives













By Topic:










By Author:











Proportional Font





LATEX-L  June 1997

LATEX-L June 1997


A Language Model for LaTeX (1/2)


Frank Mittelbach <[log in to unmask]>


Mailing list for the LaTeX3 project <[log in to unmask]>


Fri, 20 Jun 1997 16:41:54 +0200





text/plain (406 lines)


this is coming back to the discussions we had on this list early in
the year. what you see below is a paper which i gave in March in Japan
on a language model for LaTeX.

comments and thoughts welcome


ps the pager is more than 500 lines so it is split as this list
doesn't allow such long mails


% Copyright 1997 Frank Mittelbach, Chris Rowley


\typeout{** two pagebreaks hardcoded which might need removal}

\usepackage{shortvrb}  \MakeShortVerb{\|}


\title{Language Information in
       Structured Documents:\\
       A Model for Mark-up and Rendering\thanks
         {This paper was originally given at the
          Multilingual Information Processing symposium, March 1997,
          Tsukuba, Japan.}

\author{Frank Mittelbach \\ \[log in to unmask]
   \and Chris Rowley \\ \[log in to unmask]





This paper discusses the structure and processing of multi-lingual
documents, both at a general level and in relation to a proposed
extension to the (no longer so new) standard \LaTeX.  Both in general
and in the particular case of this proposal, our work would be
impossible without the enormous support, both practical and moral, we
get from our fellow members of the \LaTeX3 project
team\footnote{Current \LaTeX3 project team members are Johannes Braams
(NL), David Carlisle (UK), Michael Downes (USA), Alan Jeffrey (UK) and
Rainer Sch\"opf (DE).} (who maintain and enhance \LaTeX) and from
people all over the world who contribute to the development of
\LaTeX{} with their suggestions and comments.

The paper starts by examining the language structure of documents and
from this a language tag model for \LaTeX{} is developed.  It then
discusses the relationship between language and document formatting
and the types of actions needed at a change of language.  This will
lead to a model that supports the specification of these actions and
of their association with the tag structure in the abstract document.

The model is then extended to provide the necessary support for
regions that have their own visual context or that receive content
from other parts of the document, thus breaking the basic tree
structure of an abstract document---this is in

Finally a high level summary of the required interfaces is given. A
full formal specification, to be used for a prototype implementation
in \LaTeX{}, is currently under development---a first public test
implementation is expected to exist for the 1997/12/01 release of

If you are interested in the issues raised in this paper or in other
aspects of our work to enhance \LaTeX, please join the project's electronic
discussion list. To do this, please send a message to:
  \[log in to unmask]
Containing this line:
  \texttt{subscribe LATEX-L  \textit{your name}}


\section{Language Structure of Documents}

Structured documents can be understood as being explicitly or implicitly
labeled with ``language tags'' denoting that a portion of the
document contains data written in a certain ``language''.

These tags have the following properties:
\item They impose on the document a hierarchical tree structure that
  may not be compatible with that document's other logical structure,
  e.g., there might be a language change in the middle of a logical
  element such as a list item.\footnote{However, for practical
    purposes it is normally possible and acceptable to artificially
    force the structure imposed by the language tags into the logical
    hierarchy imposed by other tags.}
  At any one point in the document the ``current language'' can be

The term ``language'' in this context is somewhat vague and might
need further qualification; but for the purpose of the following
discussion it is sufficient to define it as a `label' whose value
affects certain aspects of formatting.

\subsection{Hierarchy of language tags}

The structure created by attaching such language tags to the text can
be considered to be of varying complexity. The simplest case would be
to regard this as a flat structure: for each point in the document
only a ``current'' language is defined, disregarding the fact that
certain language segments can be considered to be embedded within
others. This model of language within documents is, for example,
employed within the current Babel system where, by default, all
language changes are in this sense global.

In a more complex model each area has a ``current'' language but may
be embedded within a nest of larger areas, each in its own language.
In such a model, a change of language has a different quality, and
therefore may invoke different formatting changes, depending on the
level in the hierarchy at which it occurs.

Our investigations lead us to conclude that, to properly render a
document, one needs a combination of both models:
\item the concept of a base language for very large portions of a text
  (for most documents this will in fact be only one such language for
  the full text): this has a flat structure, there is only one base
  language at any point in the text;
\item the concept of imbedded language segments: these are nestable (to
  any number of levels) and are used for relatively small-scale
  insertions within a base language, such as quotations or names.


\subsection{Language tag (visual) structure}

In addition to the nesting structure of language tags, there is a more
visual component that influences rendering of a document: the
paragraph structure.  To properly model this typographical treatment
it is necessary to classify the language tags according to whether a
language segment contains only complete paragraphs or is part of the
running text of a single paragraph.  A begin/end pair of tags is
called a ``block-level'' tag if its body consists of complete
paragraphs and a ``paragraph-level'' tag otherwise.  As later examples
will show, the typographical treatment for these two types is often

%\begin{picture}(32,12)(-2,-1)                   % use commented out
                                                 % lines if showing document-level


%\multiputlist(8,10)(0,-2)[r]{document level,base language level,first
\multiputlist(8,8)(0,-2)[r]{base language level,first
  nesting level,second nesting level,\ldots,n\textsuperscript{th}
  nesting level}

%\dottedline[$\bullet$]{2}(15,10)(15,10)      % just for a single
                                              % bullet  lazy as i am :-)

% \put(17,10){\makebox(0,0)[l]{document level}}
\multiputlist(17,8)(1,-2)[l]{base language level, paragraph level,nested paragraph


\caption{The two hierarchies}\label{fig:twohs}

\section{A Tag Model for \LaTeX{}}

To support the above model, including both nesting of language tags
and the differentiation between block- and paragraph-level tags, the
following tag structure for a system like \LaTeX{} is proposed:
  A document language tag (implicit). This tag can be used to attach
  language-related typographical actions that should not change even
  if the document contains more than one base language.

  Base-language tags: used only at top-level, no nesting. These tags
  denote the major language(s) within a document. In the case of
  essentially mono-lingual documents the base language would be the
  same as the document language.

  Language-block tags: contain complete paragraphs, nestable. These
  denote larger imbeddings either directly within the base language or
  further down in the nesting hierarchy.

  Language-fragment tags: only within paragraphs, nestable. These
  denote smaller imbeddings but are otherwise identical to language
  block tags.

Note that since, at least in the logical structure of a document,
paragraphs can occur within paragraphs, block tags can be nested
within fragment tags.

\subsection{Document interfaces}

As \LaTeXe{} does not have built in support for named attributes, its
support for language changes is best implemented by introducing
additional language tags (commands and environments). A concrete
syntax for these tags could include the following:
  A preamble declaration for the document language (this is also the
  base language in mono-lingual documents) with the language-label as

  A base-language change command with the language-label as
  argument.  This command is declarative to highlight the flat
  structure of base languages.

\item A language-environment with the language-label as argument and
  text as body.  Such an environment starts a new paragraph so as to enforce
  the block-level nature of the tag.

  A language-command with the language-label and text both as
  arguments. In contrast to the environment, this command applies
  language-related actions to its second argument, which cannot
  directly contain full paragraphs.

For \LaTeX3 we shall probably normalize this interface by supporting a
language attribute on appropriate tags. This would allow, for example,
a trivial translation of the language features currently being
proposed for HTML into \LaTeX{} for rendering purposes.  However, even
in that case generic tags for changing language are necessary as
typical documents contain language changes that do not coincide
with the tag boundaries of other logical tags.\footnote{It is proposed
that HTML\,3.2 supports a \texttt{<span>} tag for this purpose.}

\section{Language-dependent Processing}

Setting up the tags tells us only how to encode a multi-lingual
document.  We now need to specify how these tags affect the processing
of the document; how do we attach actions to them?  Before answering
this question we shall first discuss a number of representative
examples of the effects of language on this processing, classified
according to the categories input, transformation and formatting.

The actions shown below are all commonly related to a change of
language within a document.  Nevertheless, it is not the case that
each of them should necessarily be implemented by attaching them
firmly to language changes.  For some it might be more appropriate to
freeze them for the whole document or to attach them to areas within the
document that do not coincide with language boundaries.


\paragraph{Input encodings}
Entering text in a certain language often requires special
input methods (this is especially true for languages with complex
scripts) but even in cases where direct keyboard entry is possible it
might be necessary to add information about the keyboard codepage that
is to be used, so as to interpret the source characters correctly. At
present \LaTeX{} supports variable interpretation of the upper
half of the 8-bit plane, thus allowing source text to be 8-bit encoded
in one of the many keyboard encodings used world wide.

With the development of language packages and the subsequent
development of the Babel system, it became common practice to extend
the mark-up language of \LaTeX{} using so called ``short-refs'' as a
compact method for inputting certain commands. Short-refs are
character sequences that do not start with \TeX's escape character,
i.e.~usually `|\|', but nevertheless act like commands. That is, they
do not represent the equivalent glyph sequence but have either
additional effects (e.g.,~the punctuation marks in French typography,
which produce additional space) or even denote completely different
actions (e.g.,~|""| for a break point without a hyphen).

In addition to the above short-refs, some \TeX{} fonts implement
short-refs by using (or misusing) the ligature mechanism to implement
arbitrary input syntax, e.g., |``| generating `` or |---| generating
an em-dash.

Short-refs can be used for different purposes:
  providing a compact input notation for commonly used textual
  commands such as characters with diacritical marks;
  providing a compact and readable input notation for special
  applications, e.g., |==>| for |\Longrightarrow|;
  providing typographical features not otherwise supported
  (e.g.,~extra space in front of punctuation characters).
The first two items are related to input syntax and not directly
linked to the language of the current text although historically they
have been provided by language packages, e.g., |"a| as a short-ref for
|\"{a}| was implemented by |german.sty| and within Babel its meaning
gets deactivated within regions marked up as belonging to other

The third item is directly related to language since short-refs of
this type are used to implement a typographic style that is
characteristic of a language in such a way that the user is not
forced to use explicit mark-up in the document.


Here, `transformations' include only manipulations of the source text
that are independent of formatting information (i.e.~those that act
entirely on the logical document).  Usually such transformations
enrich the document content in one way or the other by using knowledge
stored outside the document source.

\paragraph{Generated text}

This is text that is not directly encoded in the source document but
is produced from tags therein.  Generated text can be classified into
to categories: content-related and structure-related.  Here
content-related text is that generated by tags that can appear
anywhere in the source text (a typical \LaTeX{} example would be the
|\today| command) while structure-related text refers to text that is
associated with a high level logical structure (e.g., the heading
produced for a bibliography or the fixed text used in a figure

While it is imaginable to keep structure-related text in one language
even though the surrounding language changes, content-related text
most likely will have to change at every language tag.


The finding and marking of possible hyphenation points is, perhaps,
the most obvious language-related transformation.  Indeed, it is often
considered to be the defining characteristic of a `language'.

When using \TeX{} this relationship is unfortunately obscured by some
technical details of the implementation of hyphenation.  One of these
is that \TeX's hyphenation does not depend only on the `language' but
also on the current font encoding (which can differ within a single
language).  Another is \TeX's restriction that one can properly
hyphenate a whole multi-lingual paragraph only if the font encodings
used therein share a single lower-case table (and this is likely not
to be the case if more than one script is present).

\paragraph{Upper- and lower-case transformations}

The mapping between upper- and lower-case characters (for those
writing systems that make such a distinction) is language-dependent
(and not just script-dependent): for example, in Turkish \i{}$\to$I
and i$\to$\.I in contrast to the usual mapping i$\to$I used in most
other languages. There can also be a one-to-many mapping as for the
German \ss{} that maps to SS.

Top of Message | Previous Page | Permalink

Advanced Options


Log In

Log In

Get Password

Get Password

Search Archives

Search Archives

Subscribe or Unsubscribe

Subscribe or Unsubscribe


September 2019
July 2019
April 2019
March 2019
February 2019
January 2019
December 2018
November 2018
October 2018
September 2018
August 2018
June 2018
May 2018
April 2018
February 2018
January 2018
December 2017
September 2017
August 2017
July 2017
June 2017
May 2017
April 2017
March 2017
February 2017
January 2017
December 2016
July 2016
April 2016
March 2016
February 2016
January 2016
October 2015
September 2015
August 2015
July 2015
June 2015
May 2015
April 2015
March 2015
February 2015
January 2015
December 2014
November 2014
October 2014
September 2014
August 2014
July 2014
June 2014
May 2014
April 2014
March 2014
February 2014
January 2014
December 2013
November 2013
October 2013
August 2013
July 2013
June 2013
May 2013
April 2013
March 2013
February 2013
January 2013
December 2012
November 2012
September 2012
August 2012
July 2012
June 2012
May 2012
March 2012
February 2012
January 2012
December 2011
November 2011
October 2011
September 2011
August 2011
July 2011
June 2011
May 2011
April 2011
March 2011
February 2011
January 2011
December 2010
November 2010
October 2010
September 2010
August 2010
July 2010
June 2010
May 2010
March 2010
February 2010
January 2010
December 2009
November 2009
October 2009
September 2009
August 2009
July 2009
June 2009
May 2009
April 2009
March 2009
February 2009
January 2009
December 2008
November 2008
October 2008
September 2008
May 2008
April 2008
March 2008
February 2008
January 2008
December 2007
November 2007
September 2007
August 2007
June 2007
May 2007
March 2007
December 2006
November 2006
October 2006
August 2006
July 2006
June 2006
May 2006
April 2006
March 2006
February 2006
November 2005
October 2005
September 2005
August 2005
May 2005
April 2005
March 2005
November 2004
October 2004
August 2004
July 2004
May 2004
April 2004
March 2004
February 2004
January 2004
December 2003
October 2003
August 2003
July 2003
June 2003
May 2003
April 2003
March 2003
February 2003
January 2003
December 2002
October 2002
September 2002
August 2002
July 2002
June 2002
March 2002
December 2001
October 2001
September 2001
August 2001
July 2001
June 2001
May 2001
April 2001
March 2001
February 2001
January 2001
December 2000
November 2000
October 2000
September 2000
August 2000
July 2000
May 2000
April 2000
March 2000
February 2000
January 2000
December 1999
November 1999
October 1999
September 1999
August 1999
May 1999
April 1999
March 1999
February 1999
January 1999
December 1998
November 1998
October 1998
September 1998
August 1998
July 1998
June 1998
May 1998
April 1998
March 1998
February 1998
January 1998
December 1997
November 1997
October 1997
September 1997
August 1997
July 1997
June 1997
May 1997
April 1997
March 1997
February 1997
January 1997
December 1996



Universität Heidelberg | Impressum | Datenschutzerklärung

CataList Email List Search Powered by the LISTSERV Email List Manager