On 16/03/2020 17:01, Kelly Smith wrote:
> I’ve been thinking: since Lua is already involved in the build process,
> by way of l3build, wouldn’t it be reasonable to use a lua script
> to preprocess Unicode data into forms that are easily consumed by LaTeX
> during the format-building process?
It depends on the outcome you are after.
The original loading method for Unicode data in XeTeX was via a Perl
script. That created a .tex file containing (for example) catcode data.
To update the Unicode data, one had to run the Perl script, then send
the processed files to CTAN. There were two issues. First, that meant
that any change required active work to not only get the data from
Unicode but also to manipulate it. Second, and more significant, it was
*slower* than just reading the files in TeX. (This only became apparent
when I wrote some test parsers.)
Now, there is more data being loaded today than when I did that work,
and some of it is in LuaTeX so could be done Lua-only. It's also
possible that the Perl script was sub-optimal, or that as part of a
general 'install' function the time would not really show. However,
XeTeX needs the data, so one is still looking at having to explicitly
pre-process in Lua. Moreover, most of the time taken for format-building
is not about reading Unicode data. With LuaTeX, pre-loading expl3 does
cut out a slight 'stall' when loading everything for case-changing, but
having a LuaTeX and a XeTeX path separately is not attractive.
The current set-up means that updating the Unicode files is just a
question of copy-pasting the raw .txt files into a form that CTAN can
accept. Pre-digesting still leaves us needing some way to co-ordinate
between packages (format, luaotfload, expl3, specialist stuff), plus
with having to do the explicit extraction.
As format-building is all about saving time for 'normal' runs, I'm not
seeing there is a massive need to speed up the process. I know there is
one engine in development that doesn't use format files, so that might
be a place to consider things, but I think we'd need a strong case to
alter the approach for XeTeX/LuaTeX (pdfTeX, ...).