LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

Options: Use Classic View

Use Proportional Font
Show HTML Part by Default
Condense Mail Headers

Topic: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Sender: Mailing list for the LaTeX3 project <[log in to unmask]>
Date: Mon, 10 Oct 2011 21:23:39 -0400
Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]>
MIME-Version: 1.0
Message-ID: <[log in to unmask]>
In-Reply-To: <[log in to unmask]>
Content-Type: text/plain; charset=ISO-8859-1
From: Bruno Le Floch <[log in to unmask]>
Parts/Attachments: text/plain (120 lines)
> Does that mean you compile to a finite automaton? Then double wow!

Thank you Lars for taking the time to look at the code. You may have
noticed some pieces of code relating to the {m,n} quantifier. The
implementation of that is not done yet. How important would that
feature be?

>> - Some day I may add printf-like string formatting. Is that useful?
>
> Frank shrieb: I would think so. anything with input output formatting benefits from
> this. It is not a primary typesetting function but ... ;-)

I'll do it in a few weeks (months?).

>> - I had the idea of providing # as a shorthand for .*?
> Frank shrieb: "no"
> Lars wrote: "no"

Ok.

> Lars: For parsing text where balancing matters, I would suggest using Parsing
> Expression Grammars (instead of mimicing Perlish extensions to regexps):
> most of the expressive power of BNFs (and then some), none of the ambiguity,
> and capable of doing it in linear time!

I looked at that this afternoon. Would that be the right framework for
code pretty-printing similar to listings/minted (but hopefully more
powerful)?

>> - Same question for caseless matching, and for look-ahead/look-behind
>> assertions.
>
> personally (in other languages) I alway find it useful to be able to
> match  caseless (as an option). But then this requires an understanding
> on what is "case". Whether that is however important enough to go some
> length ... dunno

It isn't clear yet what the regex module would be used for in a LaTeX
document :), and that feature is not too hard to add. I'll do that
soon.

>> The l3regex module allows for testing if a string matches a given
>> regular expression, counting matches, extracting submatches, splitting
>> at occurrences of a regular expression,
>
> Sometimes one goes the other way, and returns all matches of a string. I
> wouldn't go so far as saying that this is more useful than splitting at
> things that match, however.

I followed perl here: capturing groups are left as items in the sequence:

\regex_split:nnN { ( \W+ ) } { Hello,~world! } \l_result_seq

gives five items: {Hello} {,~} {world} {!} {}. So it is possible to
extract those matches (with the junk in between). If you think it's
worth it, I'll add \regex_extract_all:nnN, and rename
\regex_extract:nnN to \regex_extract_once:nnN.

> Having browsed the code now, I see that you execute the automaton in general
> NFA form. Have you considered executing it in epsilon-free NFA form instead?
> Getting rid of epsilons does not increase the number of states (it can even
> reduce the number of reachable states), but I'm not sure how it would
> interact with match path priority...

I don't know how to obtain an epsilon-free NFA form. And I am not sure
whether that would improve performance: the main problem with epsilon
transitions is that you need to keep track of which states have been
attained by an epsilon transition to avoid infinite loops. This is
done efficiently using one \dimen register per NFA state, and making
sure that they don't need to be reinitialized at every character. Do
you think that reaching an epsilon-free form (rather than simply
reducing the number of epsilon transitions) would be better?

I definitely need some optimization on the number of states, though
(currently quite wasteful for groups and alternation), because most of
the running time is going around from state to state. The details are
hard to get right, because you need to consider all the possibilities
with quantifiers.

>> - Newlines. Currently, "." matches every character; in perl and PCRE
>> it should not match new lines.
>
> The way I'm used to regexps, you can control using "metasyntax" whether .
> matches newline or not, and likewise how ^ behaves with respect to
> beginning-of-line, so I wouldn't find it wrong if ^ and \A are synonyms.

Then should it be controllable by the user? I have to implement (?i)
for caseless, so putting other options, like (?s) for dotall and (?m)
for making dot not match everything, etc. is not hard. Not sure how
useful that would be, though.

>> - Same question for caseless matching, and for look-ahead/look-behind
>> assertions.
>
> Can you even do assertions without exploding the number of states? (Whenever
> I've thought about it, I tend to end up with the conclusion that one cannot,
> but I might be missing something.)

My approach for look-ahead assertions is to build an automaton for
that sub-regular expression and store it somewhere. When the assertion
is met upon running the NFA, grab what remains of the string, enter a
new TeX grouping level, setup the NFA for that assertion, and run it
within the group, then close the group returning the result of the
assertion.

For look-behind assertions, you can in principle reverse the string
and reverse concatenation in the regular expression which you want to
assert, then run the reversed assertion on the reversed string. Then
all is identical to a look-ahead assertion. The automaton-based
approach actually allows look-ahead assertions with arbitrary length
(which perl and pcre can't do :D).

It is not a fast way of doing things: in principle, linear in the size
of the string each time an assertion is tested, but the constant is
small, so I think it ends up being linear in the length of the part of
the string that had to be read before deciding the assertion.

Regards,
Bruno

ATOM RSS1 RSS2