LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

Options: Use Forum View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Lars Hellström <[log in to unmask]>
Reply To:
Mailing list for the LaTeX3 project <[log in to unmask]>
Date:
Tue, 11 Oct 2011 17:25:45 +0200
Content-Type:
text/plain
Parts/Attachments:
text/plain (150 lines)
Bruno Le Floch skrev 2011-10-11 03.23:
>> Does that mean you compile to a finite automaton? Then double wow!
>
> Thank you Lars for taking the time to look at the code. You may have
> noticed some pieces of code relating to the {m,n} quantifier. The
> implementation of that is not done yet. How important would that
> feature be?

I don't tend to use it much myself, but I suspect some people will be very 
disappointed if it is missing (e.g. to cut 20111011 into 2011-10-11).

>> Lars: For parsing text where balancing matters, I would suggest using Parsing
>> Expression Grammars (instead of mimicing Perlish extensions to regexps):
>> most of the expressive power of BNFs (and then some), none of the ambiguity,
>> and capable of doing it in linear time!
>
> I looked at that this afternoon. Would that be the right framework for
> code pretty-printing similar to listings/minted (but hopefully more
> powerful)?

Strongly yes. As Ford mentions in his POPL paper on PEGs, the traditional 
context-free grammars suck at expressing things like "an identifier is a 
_longest_ sequence of consecutive alphanumeric characters", which is why 
there is typically a separate tokenising step before the parser proper. 
PEGs, on the other hand, have no problem with parsing from the character 
level and up in one go. For pretty-printing, I'd expect it to be a great 
convenience to not have to do it in two steps.

>>> The l3regex module allows for testing if a string matches a given
>>> regular expression, counting matches, extracting submatches, splitting
>>> at occurrences of a regular expression,
>>
>> Sometimes one goes the other way, and returns all matches of a string. I
>> wouldn't go so far as saying that this is more useful than splitting at
>> things that match, however.
>
> I followed perl here: capturing groups are left as items in the sequence:
>
> \regex_split:nnN { ( \W+ ) } { Hello,~world! } \l_result_seq
>
> gives five items: {Hello} {,~} {world} {!} {}. So it is possible to
> extract those matches (with the junk in between). If you think it's
> worth it, I'll add \regex_extract_all:nnN, and rename
> \regex_extract:nnN to \regex_extract_once:nnN.

A
    \regex_extract_all:nnN { \w+ } { Hello,~world! } \l_result_seq
would produce
    {Hello} {world}
The main difference is: Does my regexp specify what a "part" look like, or 
does it specify what is between the parts look like? But I've occasionally 
missed not having easy access to what comes between them, so _split is 
definitely good. _all is something one often can do without, especially since
    \regex_split:nnN { (\w+) } { Hello,~world! } \l_result_seq
gives
    {} {Hello} {,~} {world} {!}
if I understand you correctly. Maybe all that is needed is an example in the 
documentation.

Rather than \regex_extract_once:nnN, I would probably call it 
\regex_extract_first:nnN.

>> Having browsed the code now, I see that you execute the automaton in general
>> NFA form. Have you considered executing it in epsilon-free NFA form instead?
>> Getting rid of epsilons does not increase the number of states (it can even
>> reduce the number of reachable states), but I'm not sure how it would
>> interact with match path priority...
>
> I don't know how to obtain an epsilon-free NFA form.

For every state in the NFA, find out which states that can be reached from 
it using epsilon-transitions only. Then build a new NFA, with the same 
states as the old one, and at every state put every outgoing non-epsilon 
transition of the old NFA which starts at any of the states 
epsilon-reachable from it. In the new NFA, a state is final if a final state 
of the old NFA is epsilon-reachable from it.

States in the old NFA whose only incoming transitions were epsilons are 
unreachable in the new NFA, and can be dropped.

A catch could be if the NFA representation you have is not very amenable to 
performing this operation on.

> And I am not sure
> whether that would improve performance: the main problem with epsilon
> transitions is that you need to keep track of which states have been
> attained by an epsilon transition to avoid infinite loops. This is
> done efficiently using one \dimen register per NFA state, and making
> sure that they don't need to be reinitialized at every character. Do
> you think that reaching an epsilon-free form (rather than simply
> reducing the number of epsilon transitions) would be better?

You're essentially doing at runtime what an epsilon-elimination would do at 
compile-time. Hence an epsilon-elimination could in principle be worse if 
you're running the regexp once on a small string, since the elimination 
would do it throughout the NFA, whereas the runtime processing only acts on 
those parts of the NFA that actually gets visited for this string.

> I definitely need some optimization on the number of states, though
> (currently quite wasteful for groups and alternation), because most of
> the running time is going around from state to state.

Epsilon-elimination would give you the property that the automaton must 
advance in the string whenever it changes state.

> The details are
> hard to get right, because you need to consider all the possibilities
> with quantifiers.
>
>>> - Newlines. Currently, "." matches every character; in perl and PCRE
>>> it should not match new lines.
>>
>> The way I'm used to regexps, you can control using "metasyntax" whether .
>> matches newline or not, and likewise how ^ behaves with respect to
>> beginning-of-line, so I wouldn't find it wrong if ^ and \A are synonyms.
>
> Then should it be controllable by the user? I have to implement (?i)
> for caseless, so putting other options, like (?s) for dotall and (?m)
> for making dot not match everything, etc. is not hard. Not sure how
> useful that would be, though.

I you bother to implement it, (?s) should be the default. It would be the 
one that does the least work, and since (as Frank remarked) there will 
almost never be any newlines in the string anyway, then why spend time 
looking out for them?

>
>>> - Same question for caseless matching, and for look-ahead/look-behind
>>> assertions.
>>
>> Can you even do assertions without exploding the number of states? (Whenever
>> I've thought about it, I tend to end up with the conclusion that one cannot,
>> but I might be missing something.)
>
> My approach for look-ahead assertions is to build an automaton for
> that sub-regular expression and store it somewhere. When the assertion
> is met upon running the NFA, grab what remains of the string, enter a
> new TeX grouping level, setup the NFA for that assertion, and run it
> within the group, then close the group returning the result of the
> assertion.

OK, same technique as in Henry Spencer's regexp library then. Sounds 
feasible. The problem that (I think) leads to the state explosion when doing 
it in a pure automaton setting is that you in principle have to allow 
different invokations of the same assertion to overlap, so it's worse than 
merely doing an AND of two regexps (the main regexp and the assertion 
sub-regexp).

Lars Hellström

ATOM RSS1 RSS2