Michiel Helvensteijn skrev 2013-10-24 13.43:
> On Thu, Oct 24, 2013 at 10:01 AM, Lars Hellström
> <[log in to unmask]> wrote:
>> An optimal approach for your setting would perhaps be a lazy powerset
>> construction: don't carry it out for a particular state in the DFA until you
>> actually reach that state
during matching (should I perhaps have added for clarity)
>> , but then record the result; matching typically
>> doesn't reach all the states of the DFA, since you don't match against
>> arbitrary data. This, however, assumes that you have some efficient method
>> of memoising the results from lazy computations. (I suppose a family of
>> control sequence names could do the trick.)
> Interesting, I haven't come across the idea of a lazy powerset
> construction before. But I don't think I fully understand what you are
> proposing. Using "a family of csnames" suggests to me the general
> technique of maintaining an input-output table for a pure function.
I was thinking one csname per state of the DFA (subset of states of the
NFA), that in turn acted as a function from input character to new state of
DFA. This should work perfectly well in your one-token-at-a-time setting.
> But that would only memoize exact inputs; variations on the same
> pattern wouldn't get any benefit, and it wouldn't work in my setting
> anyway, where I get one token at a time. My current implementation
> uses a "trie" datastructure, which might be one step better than a
> simple table. But still not good enough by far, I think.
> When I first read the term "lazy powerset construction" in your post,
> I imagined restructuring the automaton on-the-fly somehow, which
> sounded pretty cool.
In my experience, most operations on automata tend to produce automata with
epsilon-transitions, which one then at some cost (not cool) has to get rid
of before executing them. Unless the operation does something even worse to
the state set... :-(
> (2) I would also need the ability to merge new (N|D)FA's into an
> existing one and mark the corresponding accepting states with an
> identifier. I could then query this identifier when a match was found,
> so I could know which original pattern(s) matched.
Doable, but whichever way you do it, you effectively end up executing all
the component automata in parallel. With DFAs, the new state space is the
cartesian product of the component state spaces. With NFAs, you have a set
of active branches for each component automaton. So if you're anyway feeding
data to your automaton one token at a time, then you might as well feed each
token to a sequence of automata instead.