Thu, 24 Oct 2013 10:01:20 +0200
text/plain; charset=ISO-8859-1; format=flowed
Michiel Helvensteijn skrev 2013-10-24 02.12:
> Secondly, an implementation question (which might lead to a
> discussion): You're now running a regex as a NFA (Nondeterministic
> Finite Automaton), keeping track of all active branches while matching
> against an input. Is there a particular reason you're not translating
> it into a DFA (Deterministic Finite Automaton) during the compilation
> phase, i.e., applying the powerset construction?
Not knowing the details of the implementation, I anyway hazard the guess
that this is because running the NFA tends to be faster in practice. Sure,
when executing an NFA you need to keep track of all active branches for this
particular input, but in carrying out the powerset construction you need to
consider any possible set of active branches for any possible input! You
need to have run the match against a rather long string before NFA matching
work starts to catch up with the work required by the powerset construction.
So unless you plan to run a single regexp against a very large amount of
text, it's in all likelyhood not worth it.
Even in cases such as yours, where you actually do mean to run the same
regexp against lots of text, it's not entirely clear that it's worth it. The
reason is that once the machinery is there, it is easy to underestimate how
large the NFA is, and the exponential computational complexity of the
powerset construction really penalises even small NFA size increases. At the
same time, it's easy to overestimate the amount of text a matching on
average gets applied to.
An optimal approach for your setting would perhaps be a lazy powerset
construction: don't carry it out for a particular state in the DFA until you
actually reach that state, but then record the result; matching typically
doesn't reach all the states of the DFA, since you don't match against
arbitrary data. This, however, assumes that you have some efficient method
of memoising the results from lazy computations. (I suppose a family of
control sequence names could do the trick.)