\iffalse this is coming back to the discussions we had on this list early in the year. what you see below is a paper which i gave in March in Japan on a language model for LaTeX. comments and thoughts welcome frank ps the pager is more than 500 lines so it is split as this list doesn't allow such long mails \fi % % Copyright 1997 Frank Mittelbach, Chris Rowley % \documentclass[a4paper]{article} \typeout{********************************************} \typeout{** two pagebreaks hardcoded which might need removal} \typeout{********************************************} \flushbottom \usepackage{shortvrb} \MakeShortVerb{\|} \usepackage{epic} \begin{document} \title{Language Information in Structured Documents:\\ A Model for Mark-up and Rendering\thanks {This paper was originally given at the Multilingual Information Processing symposium, March 1997, Tsukuba, Japan.} } \author{Frank Mittelbach \\ \[log in to unmask] \and Chris Rowley \\ \[log in to unmask] \date{} \maketitle %\tableofcontents \section{Introduction} This paper discusses the structure and processing of multi-lingual documents, both at a general level and in relation to a proposed extension to the (no longer so new) standard \LaTeX. Both in general and in the particular case of this proposal, our work would be impossible without the enormous support, both practical and moral, we get from our fellow members of the \LaTeX3 project team\footnote{Current \LaTeX3 project team members are Johannes Braams (NL), David Carlisle (UK), Michael Downes (USA), Alan Jeffrey (UK) and Rainer Sch\"opf (DE).} (who maintain and enhance \LaTeX) and from people all over the world who contribute to the development of \LaTeX{} with their suggestions and comments. The paper starts by examining the language structure of documents and from this a language tag model for \LaTeX{} is developed. It then discusses the relationship between language and document formatting and the types of actions needed at a change of language. This will lead to a model that supports the specification of these actions and of their association with the tag structure in the abstract document. The model is then extended to provide the necessary support for regions that have their own visual context or that receive content from other parts of the document, thus breaking the basic tree structure of an abstract document---this is in section~\ref{sec:moving}. Finally a high level summary of the required interfaces is given. A full formal specification, to be used for a prototype implementation in \LaTeX{}, is currently under development---a first public test implementation is expected to exist for the 1997/12/01 release of \LaTeX. If you are interested in the issues raised in this paper or in other aspects of our work to enhance \LaTeX, please join the project's electronic discussion list. To do this, please send a message to: \begin{quote} \[log in to unmask] \end{quote} Containing this line: \begin{quote} \texttt{subscribe LATEX-L \textit{your name}} \end{quote} \newpage \section{Language Structure of Documents} Structured documents can be understood as being explicitly or implicitly labeled with ``language tags'' denoting that a portion of the document contains data written in a certain ``language''. These tags have the following properties: \begin{itemize} \item They impose on the document a hierarchical tree structure that may not be compatible with that document's other logical structure, e.g., there might be a language change in the middle of a logical element such as a list item.\footnote{However, for practical purposes it is normally possible and acceptable to artificially force the structure imposed by the language tags into the logical hierarchy imposed by other tags.} \item At any one point in the document the ``current language'' can be determined. \end{itemize} The term ``language'' in this context is somewhat vague and might need further qualification; but for the purpose of the following discussion it is sufficient to define it as a `label' whose value affects certain aspects of formatting. \subsection{Hierarchy of language tags} The structure created by attaching such language tags to the text can be considered to be of varying complexity. The simplest case would be to regard this as a flat structure: for each point in the document only a ``current'' language is defined, disregarding the fact that certain language segments can be considered to be embedded within others. This model of language within documents is, for example, employed within the current Babel system where, by default, all language changes are in this sense global. In a more complex model each area has a ``current'' language but may be embedded within a nest of larger areas, each in its own language. In such a model, a change of language has a different quality, and therefore may invoke different formatting changes, depending on the level in the hierarchy at which it occurs. Our investigations lead us to conclude that, to properly render a document, one needs a combination of both models: \begin{itemize} \item the concept of a base language for very large portions of a text (for most documents this will in fact be only one such language for the full text): this has a flat structure, there is only one base language at any point in the text; \item the concept of imbedded language segments: these are nestable (to any number of levels) and are used for relatively small-scale insertions within a base language, such as quotations or names. \end{itemize} \pagebreak \subsection{Language tag (visual) structure} \label{sec:visual} In addition to the nesting structure of language tags, there is a more visual component that influences rendering of a document: the paragraph structure. To properly model this typographical treatment it is necessary to classify the language tags according to whether a language segment contains only complete paragraphs or is part of the running text of a single paragraph. A begin/end pair of tags is called a ``block-level'' tag if its body consists of complete paragraphs and a ``paragraph-level'' tag otherwise. As later examples will show, the typographical treatment for these two types is often different. \begin{figure} \centering \setlength\unitlength{10pt} \frame{% %\begin{picture}(32,12)(-2,-1) % use commented out % lines if showing document-level \begin{picture}(32,10)(-2,-1) \newcommand\fragment{\begin{picture}(0,0)% \drawline(0,0)(0,-3)% \dottedline[$\bullet$]{1}(0,0)(0,-2)% \dottedline{.2}(-0.2,-2.5)(-0.2,-2.8)% \end{picture}% } \newcommand\fragmentlevels{\makebox(0,0)[rt]{\footnotesize\shortstack[l]{frag-\\ment\strut\\levels}}} %\drawline(10,10)(10,0) %\dottedline[$\bullet$]{2}(10,10)(10,0) \drawline(10,8)(10,0) \dottedline[$\bullet$]{2}(10,8)(10,0) %\multiputlist(8,10)(0,-2)[r]{document level,base language level,first \multiputlist(8,8)(0,-2)[r]{base language level,first nesting level,second nesting level,\ldots,n\textsuperscript{th} nesting level} %\dottedline[$\bullet$]{2}(15,10)(15,10) % just for a single % bullet lazy as i am :-) %\drawline(15,10)(15,8)(18,2) \drawline(15,8)(18,2) \multiput(15,8)(1,-2){3}{\fragment} % \put(17,10){\makebox(0,0)[l]{document level}} \multiputlist(17,8)(1,-2)[l]{base language level, paragraph level,nested paragraph level,\ldots} \put(14.5,7.2){\fragmentlevels} \put(15.5,3.3){\fragmentlevels} \end{picture}% } \caption{The two hierarchies}\label{fig:twohs} \end{figure} \section{A Tag Model for \LaTeX{}} To support the above model, including both nesting of language tags and the differentiation between block- and paragraph-level tags, the following tag structure for a system like \LaTeX{} is proposed: \begin{itemize} \item A document language tag (implicit). This tag can be used to attach language-related typographical actions that should not change even if the document contains more than one base language. \item Base-language tags: used only at top-level, no nesting. These tags denote the major language(s) within a document. In the case of essentially mono-lingual documents the base language would be the same as the document language. \item Language-block tags: contain complete paragraphs, nestable. These denote larger imbeddings either directly within the base language or further down in the nesting hierarchy. \item Language-fragment tags: only within paragraphs, nestable. These denote smaller imbeddings but are otherwise identical to language block tags. \end{itemize} Note that since, at least in the logical structure of a document, paragraphs can occur within paragraphs, block tags can be nested within fragment tags. \subsection{Document interfaces} \label{sec:newuser} As \LaTeXe{} does not have built in support for named attributes, its support for language changes is best implemented by introducing additional language tags (commands and environments). A concrete syntax for these tags could include the following: \begin{itemize} \item A preamble declaration for the document language (this is also the base language in mono-lingual documents) with the language-label as argument. \item A base-language change command with the language-label as argument. This command is declarative to highlight the flat structure of base languages. \item A language-environment with the language-label as argument and text as body. Such an environment starts a new paragraph so as to enforce the block-level nature of the tag. \item A language-command with the language-label and text both as arguments. In contrast to the environment, this command applies language-related actions to its second argument, which cannot directly contain full paragraphs. \end{itemize} For \LaTeX3 we shall probably normalize this interface by supporting a language attribute on appropriate tags. This would allow, for example, a trivial translation of the language features currently being proposed for HTML into \LaTeX{} for rendering purposes. However, even in that case generic tags for changing language are necessary as typical documents contain language changes that do not coincide with the tag boundaries of other logical tags.\footnote{It is proposed that HTML\,3.2 supports a \texttt{<span>} tag for this purpose.} \section{Language-dependent Processing} Setting up the tags tells us only how to encode a multi-lingual document. We now need to specify how these tags affect the processing of the document; how do we attach actions to them? Before answering this question we shall first discuss a number of representative examples of the effects of language on this processing, classified according to the categories input, transformation and formatting. The actions shown below are all commonly related to a change of language within a document. Nevertheless, it is not the case that each of them should necessarily be implemented by attaching them firmly to language changes. For some it might be more appropriate to freeze them for the whole document or to attach them to areas within the document that do not coincide with language boundaries. \subsection{Input} \paragraph{Input encodings} Entering text in a certain language often requires special input methods (this is especially true for languages with complex scripts) but even in cases where direct keyboard entry is possible it might be necessary to add information about the keyboard codepage that is to be used, so as to interpret the source characters correctly. At present \LaTeX{} supports variable interpretation of the upper half of the 8-bit plane, thus allowing source text to be 8-bit encoded in one of the many keyboard encodings used world wide. \paragraph{short-refs} With the development of language packages and the subsequent development of the Babel system, it became common practice to extend the mark-up language of \LaTeX{} using so called ``short-refs'' as a compact method for inputting certain commands. Short-refs are character sequences that do not start with \TeX's escape character, i.e.~usually `|\|', but nevertheless act like commands. That is, they do not represent the equivalent glyph sequence but have either additional effects (e.g.,~the punctuation marks in French typography, which produce additional space) or even denote completely different actions (e.g.,~|""| for a break point without a hyphen). In addition to the above short-refs, some \TeX{} fonts implement short-refs by using (or misusing) the ligature mechanism to implement arbitrary input syntax, e.g., |``| generating `` or |---| generating an em-dash. Short-refs can be used for different purposes: \begin{itemize} \item providing a compact input notation for commonly used textual commands such as characters with diacritical marks; \item providing a compact and readable input notation for special applications, e.g., |==>| for |\Longrightarrow|; \item providing typographical features not otherwise supported (e.g.,~extra space in front of punctuation characters). \end{itemize} The first two items are related to input syntax and not directly linked to the language of the current text although historically they have been provided by language packages, e.g., |"a| as a short-ref for |\"{a}| was implemented by |german.sty| and within Babel its meaning gets deactivated within regions marked up as belonging to other languages. The third item is directly related to language since short-refs of this type are used to implement a typographic style that is characteristic of a language in such a way that the user is not forced to use explicit mark-up in the document. \subsection{Transformations} Here, `transformations' include only manipulations of the source text that are independent of formatting information (i.e.~those that act entirely on the logical document). Usually such transformations enrich the document content in one way or the other by using knowledge stored outside the document source. \paragraph{Generated text} This is text that is not directly encoded in the source document but is produced from tags therein. Generated text can be classified into to categories: content-related and structure-related. Here content-related text is that generated by tags that can appear anywhere in the source text (a typical \LaTeX{} example would be the |\today| command) while structure-related text refers to text that is associated with a high level logical structure (e.g., the heading produced for a bibliography or the fixed text used in a figure caption). While it is imaginable to keep structure-related text in one language even though the surrounding language changes, content-related text most likely will have to change at every language tag. \paragraph{Hyphenation} The finding and marking of possible hyphenation points is, perhaps, the most obvious language-related transformation. Indeed, it is often considered to be the defining characteristic of a `language'. When using \TeX{} this relationship is unfortunately obscured by some technical details of the implementation of hyphenation. One of these is that \TeX's hyphenation does not depend only on the `language' but also on the current font encoding (which can differ within a single language). Another is \TeX's restriction that one can properly hyphenate a whole multi-lingual paragraph only if the font encodings used therein share a single lower-case table (and this is likely not to be the case if more than one script is present). \paragraph{Upper- and lower-case transformations} The mapping between upper- and lower-case characters (for those writing systems that make such a distinction) is language-dependent (and not just script-dependent): for example, in Turkish \i{}$\to$I and i$\to$\.I in contrast to the usual mapping i$\to$I used in most other languages. There can also be a one-to-many mapping as for the German \ss{} that maps to SS.