LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Classic View Use Monospaced Font Show Text Part by Default Condense Mail Headers
Topic:	[<< First] [< Prev] [Next >] [Last >>]

Sender: Mailing list for the LaTeX3 project <[log in to unmask]>

Date: Thu, 24 Oct 2013 10:01:20 +0200

Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]>

Message-ID: <[log in to unmask]>

Subject: Re: l3regex feature request + a question about the implementation

MIME-Version: 1.0

Content-Transfer-Encoding: 8bit

In-Reply-To: <[log in to unmask]>

Content-Type: text/plain; charset=ISO-8859-1; format=flowed

From: Lars Hellström <[log in to unmask]>

Parts/Attachments: text/plain (36 lines)

Michiel Helvensteijn skrev 2013-10-24 02.12:
> Secondly, an implementation question (which might lead to a
> discussion): You're now running a regex as a NFA (Nondeterministic
> Finite Automaton), keeping track of all active branches while matching
> against an input. Is there a particular reason you're not translating
> it into a DFA (Deterministic Finite Automaton) during the compilation
> phase, i.e., applying the powerset construction?

Not knowing the details of the implementation, I anyway hazard the guess 
that this is because running the NFA tends to be faster in practice. Sure, 
when executing an NFA you need to keep track of all active branches for this 
particular input, but in carrying out the powerset construction you need to 
consider any possible set of active branches for any possible input! You 
need to have run the match against a rather long string before NFA matching 
work starts to catch up with the work required by the powerset construction. 
So unless you plan to run a single regexp against a very large amount of 
text, it's in all likelyhood not worth it.

Even in cases such as yours, where you actually do mean to run the same 
regexp against lots of text, it's not entirely clear that it's worth it. The 
reason is that once the machinery is there, it is easy to underestimate how 
large the NFA is, and the exponential computational complexity of the 
powerset construction really penalises even small NFA size increases. At the 
same time, it's easy to overestimate the amount of text a matching on 
average gets applied to.

An optimal approach for your setting would perhaps be a lazy powerset 
construction: don't carry it out for a particular state in the DFA until you 
actually reach that state, but then record the result; matching typically 
doesn't reach all the states of the DFA, since you don't match against 
arbitrary data. This, however, assumes that you have some efficient method 
of memoising the results from lazy computations. (I suppose a family of 
control sequence names could do the trick.)

Lars Hellström

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung