> Does that mean you compile to a finite automaton? Then double wow! Thank you Lars for taking the time to look at the code. You may have noticed some pieces of code relating to the {m,n} quantifier. The implementation of that is not done yet. How important would that feature be? >> - Some day I may add printf-like string formatting. Is that useful? > > Frank shrieb: I would think so. anything with input output formatting benefits from > this. It is not a primary typesetting function but ... ;-) I'll do it in a few weeks (months?). >> - I had the idea of providing # as a shorthand for .*? > Frank shrieb: "no" > Lars wrote: "no" Ok. > Lars: For parsing text where balancing matters, I would suggest using Parsing > Expression Grammars (instead of mimicing Perlish extensions to regexps): > most of the expressive power of BNFs (and then some), none of the ambiguity, > and capable of doing it in linear time! I looked at that this afternoon. Would that be the right framework for code pretty-printing similar to listings/minted (but hopefully more powerful)? >> - Same question for caseless matching, and for look-ahead/look-behind >> assertions. > > personally (in other languages) I alway find it useful to be able to > match caseless (as an option). But then this requires an understanding > on what is "case". Whether that is however important enough to go some > length ... dunno It isn't clear yet what the regex module would be used for in a LaTeX document :), and that feature is not too hard to add. I'll do that soon. >> The l3regex module allows for testing if a string matches a given >> regular expression, counting matches, extracting submatches, splitting >> at occurrences of a regular expression, > > Sometimes one goes the other way, and returns all matches of a string. I > wouldn't go so far as saying that this is more useful than splitting at > things that match, however. I followed perl here: capturing groups are left as items in the sequence: \regex_split:nnN { ( \W+ ) } { Hello,~world! } \l_result_seq gives five items: {Hello} {,~} {world} {!} {}. So it is possible to extract those matches (with the junk in between). If you think it's worth it, I'll add \regex_extract_all:nnN, and rename \regex_extract:nnN to \regex_extract_once:nnN. > Having browsed the code now, I see that you execute the automaton in general > NFA form. Have you considered executing it in epsilon-free NFA form instead? > Getting rid of epsilons does not increase the number of states (it can even > reduce the number of reachable states), but I'm not sure how it would > interact with match path priority... I don't know how to obtain an epsilon-free NFA form. And I am not sure whether that would improve performance: the main problem with epsilon transitions is that you need to keep track of which states have been attained by an epsilon transition to avoid infinite loops. This is done efficiently using one \dimen register per NFA state, and making sure that they don't need to be reinitialized at every character. Do you think that reaching an epsilon-free form (rather than simply reducing the number of epsilon transitions) would be better? I definitely need some optimization on the number of states, though (currently quite wasteful for groups and alternation), because most of the running time is going around from state to state. The details are hard to get right, because you need to consider all the possibilities with quantifiers. >> - Newlines. Currently, "." matches every character; in perl and PCRE >> it should not match new lines. > > The way I'm used to regexps, you can control using "metasyntax" whether . > matches newline or not, and likewise how ^ behaves with respect to > beginning-of-line, so I wouldn't find it wrong if ^ and \A are synonyms. Then should it be controllable by the user? I have to implement (?i) for caseless, so putting other options, like (?s) for dotall and (?m) for making dot not match everything, etc. is not hard. Not sure how useful that would be, though. >> - Same question for caseless matching, and for look-ahead/look-behind >> assertions. > > Can you even do assertions without exploding the number of states? (Whenever > I've thought about it, I tend to end up with the conclusion that one cannot, > but I might be missing something.) My approach for look-ahead assertions is to build an automaton for that sub-regular expression and store it somewhere. When the assertion is met upon running the NFA, grab what remains of the string, enter a new TeX grouping level, setup the NFA for that assertion, and run it within the group, then close the group returning the result of the assertion. For look-behind assertions, you can in principle reverse the string and reverse concatenation in the regular expression which you want to assert, then run the reversed assertion on the reversed string. Then all is identical to a look-ahead assertion. The automaton-based approach actually allows look-ahead assertions with arbitrary length (which perl and pcre can't do :D). It is not a fast way of doing things: in principle, linear in the size of the string each time an assertion is tested, but the constant is small, so I think it ends up being linear in the length of the part of the string that had to be read before deciding the assertion. Regards, Bruno