LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Classic View Use Monospaced Font Show Text Part by Default Condense Mail Headers
Topic:	[<< First] [< Prev] [Next >] [Last >>]

Sender: Mailing list for the LaTeX3 project <[log in to unmask]>

Date: Tue, 11 Oct 2011 16:15:03 +0100

Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]>

Message-ID: <[log in to unmask]>

Subject: Re: Strings, and regular expressions

MIME-Version: 1.0

Content-Transfer-Encoding: 7bit

In-Reply-To: <[log in to unmask]>

Content-Type: text/plain; charset=ISO-8859-1

From: Joseph Wright <[log in to unmask]>

Parts/Attachments: text/plain (67 lines)

On 10/10/2011 16:07, Bruno Le Floch wrote:
> - Most encoding functions (as opposed to decoding) could be made
> expandable. Is it worth it?
> - There may be need to support other encodings. Which ones?
> - It is not clear how utf-8 input/output should be treated in pdftex.

What is needed for PDF bookmark support? This seems to be the main use
case for these things.

> - Some day I may add printf-like string formatting. Is that useful?

I agree with Frank here: there are clearly uses.

> The l3regex module allows for testing if a string matches a given
> regular expression, counting matches, extracting submatches, splitting
> at occurrences of a regular expression, and doing replacement (see
> documentation for function names).

[snip]

> If a regular expression is used a lot, a precompiled version of that
> expression can be stored in a token list. In principle, that can be
> written to a file, but it is not a very compact notation, so if that
> turns out to be useful, we can improve it.

I'm not really clear/keen on the 'save the regex' stuff. The result
seems to be we have 'N' argument functions which need a pre-compiled
regex, and 'n' ones which need a normal regex. I don't really like this,
and am really not sure it's necessary to provide optimisation in this
way. In the absence of use cases, I'm not sure about needing this type
of additional complexity.

> - Newlines. Currently, "." matches every character; in perl and PCRE
> it should not match new lines. As you know, the situation with new
> lines in TeX is a little bit odd, since they are converted to the
> \endlinechar upon reading, and normally not tokenized, simply giving
> rise to a space or a \par. Should we still decide that "." does not
> match the CR nor LF characters? Or should it simply no match the
> \endlinechar?

Treat them in a TeX way, and so have "." match everything. The likely
use case for needing to avoid matching CR or LF is vanishingly small.

> - I had the idea of providing # as a shorthand for .*? (arbitrary
> sequence of characters, lazy), mimicking what TeX does when finding a
> macro parameter. Is it useful?

Don't really like it.

> - Same question for caseless matching, and for look-ahead/look-behind
> assertions.

As Frank said, what is 'case' here? A-Za-z only?

> - A facility for matching a balanced group (e.g., as xparse does for
> optional arguments)? That is non-regular, and is difficult to
> implement, so I will only look at it if it is really needed.

Again, I'm not keen.

My overall take is that what you have is very clever, but that we should
wait for real use cases before pursuing adding more stuff which may be
clever but may not be that useful in a TeX context. (I guess you will
tackle "{m,n}" as this is reasonably standard.)
--
Joseph Wright

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung