LATEX-L Archives

Mailing list for the LaTeX3 project


Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Condense Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Mailing list for the LaTeX3 project <[log in to unmask]>
Tue, 11 Oct 2011 22:59:47 -0400
Mailing list for the LaTeX3 project <[log in to unmask]>
text/plain; charset=ISO-8859-1
Bruno Le Floch <[log in to unmask]>
text/plain (120 lines)
> I'm not really clear/keen on the 'save the regex' stuff. The result
> seems to be we have 'N' argument functions which need a pre-compiled
> regex, and 'n' ones which need a normal regex. I don't really like this,
> and am really not sure it's necessary to provide optimisation in this
> way. In the absence of use cases, I'm not sure about needing this type
> of additional complexity.

For short strings (e.g., matching \d\d\d\d-\d\d-\d\d on 2011-10-11),
one third of the time is spent on building the automaton from the
regular expression, and two thirds on running the automaton. I don't
know how important that is in practice. Two aspects:

  - providing it requires more code --- true

  - the N arguments may be confusing (e.g., some people may think that
it expects the regex as a string variable) --- not such a problem
because the variable is checked to indeed be a proper compiled regex.

If the feeling is that it should go, then I'll remove that this weekend.

>> - Newlines. Currently, "." matches every character; in perl and PCRE
> Treat them in a TeX way, and so have "." match everything. The likely
> use case for needing to avoid matching CR or LF is vanishingly small.

That seems to be the consensus.

>> - Same question for caseless matching, and for look-ahead/look-behind
>> assertions.
> As Frank said, what is 'case' here? A-Za-z only?

You have a good point, I wasn't thinking of Unicode. Do XeTeX or
LuaTeX provide access to Unicode properties of characters? Otherwise,
A-Z <=> a-z should be the only notion of case.

> My overall take is that what you have is very clever, but that we should
> wait for real use cases before pursuing adding more stuff which may be
> clever but may not be that useful in a TeX context. (I guess you will
> tackle "{m,n}" as this is reasonably standard.)

Right. I'll avoid bloating LaTeX3 :). I think the sty is around 1500
lines so far, plus 800 in l3str. Less than the current fp module.

On 10/11/11, Lars Hellström <[log in to unmask]> wrote:
> I don't tend to use it much myself, but I suspect some people will be very
> disappointed if it is missing (e.g. to cut 20111011 into 2011-10-11).

I'll implement {m,n}, but that requires quite a bit of thought, either
in the building phase or the running phase (not sure which would be
best yet).

>> I looked at that this afternoon. Would that be the right framework for
>> code pretty-printing similar to listings/minted (but hopefully more
>> powerful)?
> Strongly yes. As Ford mentions in his POPL paper on PEGs, the traditional
> context-free grammars suck at expressing things like "an identifier is a
> _longest_ sequence of consecutive alphanumeric characters", which is why
> there is typically a separate tokenising step before the parser proper.
> PEGs, on the other hand, have no problem with parsing from the character
> level and up in one go. For pretty-printing, I'd expect it to be a great
> convenience to not have to do it in two steps.

Do you have any idea on the natural resulting data structure? It's not
clear how trees should be implemented in TeX. Also, PEGs seem to lead
to endless trouble with left recursion. I'm a little bit afraid of
fighting an open problem :).

> since
>     \regex_split:nnN { (\w+) } { Hello,~world! } \l_result_seq
> gives
>     {} {Hello} {,~} {world} {!}
> if I understand you correctly. Maybe all that is needed is an example in the
> documentation.

I'll add \regex_extract_all:nnN this weekend.

> Rather than \regex_extract_once:nnN, I would probably call it
> \regex_extract_first:nnN.

Not sure about that. We already have \tl_replace_(once|all):Nnn, which
is what prompted me to use once there. @Joseph (and others), would
\tl_replace_first:Nnn make more sense than _once?

>> I don't know how to obtain an epsilon-free NFA form.
> For every state in the NFA, find out which states that can be reached
> [snip]
> A catch could be if the NFA representation you have is not very amenable to
> performing this operation on.

Thanks for the explanations!

I'll have to think about it. Getting to the epsilon-free form may help
a lot in implementing {m,n}, because then I don't need to keep track
of which states have been visited by epsilon transitions --- and with
{m,n} quantifiers, states are used multiple times. [Anyways, technical
details I'm not sure about yet.]

> OK, same technique as in Henry Spencer's regexp library then. Sounds
> feasible. The problem that (I think) leads to the state explosion when doing
> it in a pure automaton setting is that you in principle have to allow
> different invokations of the same assertion to overlap, so it's worse than
> merely doing an AND of two regexps (the main regexp and the assertion
> sub-regexp).

I haven't thought about doing it as a pure NFA. I think I'll go for
the most economical solution of running the assertion automaton on the
string, since that feature will probably not be used too much. I still
need to think of a good way of matching backwards, though (I believe
it is simply reversing all the transitions in the automaton, but the
representation I have makes that non-trivial).

I must postpone most changes to this weekend.