On Wednesday, 14 February 2001 00:42:35 +0100,
Hans Aberg <[log in to unmask]> writes:
> At 18:43 +0000 2001/02/13, David Carlisle wrote:
> >> What happens if a command in the middle of a line changes the catcodes
> >makes no difference: the notion of line for the input buffer is hardwired
> >into the implementation it is not changable via TeX commands and does
> >not depend on catcodes or the value of \endlinechar.
> >> or contains a macro that expands to a \input <filename>?
> >The rest of the line of the original file sits buffered in one of those
> >input streams until the input file finishes.
TeX's input line buffer contains all lines of all currently opened
files and at the beginning of each TeX run this line buffer is filled
with all command line arguments.
If TeX sess an \input with a filename as its argument, the new file
is opened and all lines of this file are appended at the end after the
input line where the \input occured. Thus if \input-s are stacked,
one input line per file are put into the input line buffer. Thus the
buffer looks like:
|..line, file1..|..line, file2..| ... |..line, file n..| ...
The ^ mark the part of the character buffer containing the input line
of the currently ``active'' file. If there is an \input, the first mark
will be moved to the end of the used buffer and the first line of the
newly opened file will be appended. If there is no \input, the next
line read will overwrite/replace this buffer region.
> >Incidentally one reason why xmltex can not support utf16 is that
> >TeX buffers to ^J (or ^M) and throws away any bytes with value 32 that
> >occur at the end of this buffer, which might just be half of a 16bit
> >quantity that you'd rather keep. there's no way to control this
> >behaviour from within TeX.
> So TeX is a lot less sophisticated than it appears at first sight.
David has simpliefied it a lot. Instead of saying ``TeX buffers to ^J
(or ^M)'' is should read ``TeX buffers to the system-dependend and
file type dependend end-of-line marker''. Nowadays stream oriented
files are common, where a special character (^J or ^M) or a special
combination of characters (^M^J) are used as end-of-line markers. In
the past and even nowadays there exist other file types where the
end-of-line marker is not part of the file (i.e. a special character),
e.g. files with a fixed-width record (aka line) length.
And if you have to deal with files using a fixed-width record length
usually padded by blanks, it was (and still is?) a good idea to remove
these padding character at an appropriate stage ... why not directly
after reading the line?
> >> How can this be true?
> >By magic, or the will of Knuth, or something.
> Well, it's not magic, so it must be the other then.
\input is one of TeX's magic, \endinput is the other!
Back to utf16 et al. If one wants to make TeX read these files, the
additions to the input mechanism on the character level are simple and
straight forward---just extend TeX's input_ln() function to understand
utf16 and fill the input line buffer with characters with the
resulting codes. If all these codes are in the range 0..255, you are
done. The real work begins if the codes are not in this range,
because you have to extend TeX's representation of a token (for
example as it is done by Omega).
> >At 14:44 -0500 2001/02/13, Michael John Downes wrote:
> >>Sorry, I didn't use the terminology very well. TeX input first goes into
> >>a string buffer, one line at a time. This string buffer is the only
> >>place where TeX deals with ASCII chars as input; all other "input
> >>streams" are streams of tokens. Tokenization occurs by scanning
> >>substrings from this string buffer and adding the corresponding token to
> >>the current input stream (which if we call it a "buffer", is a different
> >>buffer, not the one that contains simple 8-bit characters as first read
> >>from a file).
> >>If you get an error "TeX capacity exceeded: buffer size" it means
> >>that a line of the input file was too long to be read into the string
... to be read into the _rest_ of the _input line_ buffer.
The input line buffer can and will hold more than one line at one
time. And the string buffer or ``string pool'' is used to keep names
of control sequences, files etc.
> TeX really is a program from another age...
Yes! TeX is written between 1977 and 1982!