On Wednesday, 14 February 2001 00:42:35 +0100, Hans Aberg <[log in to unmask]> writes: > At 18:43 +0000 2001/02/13, David Carlisle wrote: > >> What happens if a command in the middle of a line changes the catcodes > > > >makes no difference: the notion of line for the input buffer is hardwired > >into the implementation it is not changable via TeX commands and does > >not depend on catcodes or the value of \endlinechar. > > > >> or contains a macro that expands to a \input <filename>? > > > >The rest of the line of the original file sits buffered in one of those > >input streams until the input file finishes. > > OK. TeX's input line buffer contains all lines of all currently opened files and at the beginning of each TeX run this line buffer is filled with all command line arguments. If TeX sess an \input with a filename as its argument, the new file is opened and all lines of this file are appended at the end after the input line where the \input occured. Thus if \input-s are stacked, one input line per file are put into the input line buffer. Thus the buffer looks like: |..line, file1..|..line, file2..| ... |..line, file n..| ... ^ ^ The ^ mark the part of the character buffer containing the input line of the currently ``active'' file. If there is an \input, the first mark will be moved to the end of the used buffer and the first line of the newly opened file will be appended. If there is no \input, the next line read will overwrite/replace this buffer region. > >Incidentally one reason why xmltex can not support utf16 is that > >TeX buffers to ^J (or ^M) and throws away any bytes with value 32 that > >occur at the end of this buffer, which might just be half of a 16bit > >quantity that you'd rather keep. there's no way to control this > >behaviour from within TeX. > > So TeX is a lot less sophisticated than it appears at first sight. David has simpliefied it a lot. Instead of saying ``TeX buffers to ^J (or ^M)'' is should read ``TeX buffers to the system-dependend and file type dependend end-of-line marker''. Nowadays stream oriented files are common, where a special character (^J or ^M) or a special combination of characters (^M^J) are used as end-of-line markers. In the past and even nowadays there exist other file types where the end-of-line marker is not part of the file (i.e. a special character), e.g. files with a fixed-width record (aka line) length. And if you have to deal with files using a fixed-width record length usually padded by blanks, it was (and still is?) a good idea to remove these padding character at an appropriate stage ... why not directly after reading the line? > >> How can this be true? > > > >By magic, or the will of Knuth, or something. > > Well, it's not magic, so it must be the other then. ;-))) \input is one of TeX's magic, \endinput is the other! Back to utf16 et al. If one wants to make TeX read these files, the additions to the input mechanism on the character level are simple and straight forward---just extend TeX's input_ln() function to understand utf16 and fill the input line buffer with characters with the resulting codes. If all these codes are in the range 0..255, you are done. The real work begins if the codes are not in this range, because you have to extend TeX's representation of a token (for example as it is done by Omega). > >At 14:44 -0500 2001/02/13, Michael John Downes wrote: > >>Sorry, I didn't use the terminology very well. TeX input first goes into > >>a string buffer, one line at a time. This string buffer is the only > >>place where TeX deals with ASCII chars as input; all other "input > >>streams" are streams of tokens. Tokenization occurs by scanning > >>substrings from this string buffer and adding the corresponding token to > >>the current input stream (which if we call it a "buffer", is a different > >>buffer, not the one that contains simple 8-bit characters as first read > >>from a file). > >> > >>If you get an error "TeX capacity exceeded: buffer size" it means > >>that a line of the input file was too long to be read into the string > >>buffer. ... to be read into the _rest_ of the _input line_ buffer. The input line buffer can and will hold more than one line at one time. And the string buffer or ``string pool'' is used to keep names of control sequences, files etc. > TeX really is a program from another age... Yes! TeX is written between 1977 and 1982! Best wishes, -bernd