On Thu, Oct 24, 2013 at 10:01 AM, Lars Hellström
<[log in to unmask]> wrote:
> So unless you plan to run a single regexp against a very large amount of
> text, it's in all likelyhood not worth it.
You may be right. I find it hard to be certain without testing, but
that would take quite some effort.
> An optimal approach for your setting would perhaps be a lazy powerset
> construction: don't carry it out for a particular state in the DFA until you
> actually reach that state, but then record the result; matching typically
> doesn't reach all the states of the DFA, since you don't match against
> arbitrary data. This, however, assumes that you have some efficient method
> of memoising the results from lazy computations. (I suppose a family of
> control sequence names could do the trick.)
Interesting, I haven't come across the idea of a lazy powerset
construction before. But I don't think I fully understand what you are
proposing. Using "a family of csnames" suggests to me the general
technique of maintaining an input-output table for a pure function.
But that would only memoize exact inputs; variations on the same
pattern wouldn't get any benefit, and it wouldn't work in my setting
anyway, where I get one token at a time. My current implementation
uses a "trie" datastructure, which might be one step better than a
simple table. But still not good enough by far, I think.
When I first read the term "lazy powerset construction" in your post,
I imagined restructuring the automaton on-the-fly somehow, which
sounded pretty cool. My second thought, though: the powerset
construction seems like an all-or-nothing deal. You can't get rid of
nondeterministic branches in the automaton until you're sure you've
captured all possible paths that lead from them into the DFA part, and
this can't be done with a single input.
Ah, and about my feature request: It's not quite as simple as I stated
before (but not that far from it):
(1) It's not enough to go 'one step back' when a match is no longer
possible. I'd need to go back to the last time an accepting state was
reached. Moreover, I'd need to get back the tokens that weren't part
of the match, so I can feed them back into the input stream. (Although
this is something I already implemented on top of my "trie", and could
be easily transferred outside of the l3regex interface.)
(2) I would also need the ability to merge new (N|D)FA's into an
existing one and mark the corresponding accepting states with an
identifier. I could then query this identifier when a match was found,
so I could know which original pattern(s) matched.
Hm.. I'm starting to suspect that a DFA is the only way to make this
in any way efficient.