LISTSERV - LATEX-L Archives - LISTSERV.UNI-HEIDELBERG.DE

LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

	LISTSERV Archives
	LATEX-L Home

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Classic View Use Monospaced Font Show HTML Part by Default Condense Mail Headers
Topic:	[<< First] [< Prev] [Next >] [Last >>]

Sender: Mailing list for the LaTeX3 project <[log in to unmask]>

Date: Thu, 13 Oct 2011 12:54:04 +0200

Content-Disposition: inline

Reply-To: Mailing list for the LaTeX3 project <[log in to unmask]>

Subject: Re: Strings, and regular expressions

MIME-Version: 1.0

Message-ID: <[log in to unmask]>

In-Reply-To: <[log in to unmask]>

Content-Type: text/plain; charset=us-ascii

From: Heiko Oberdiek <[log in to unmask]>

Parts/Attachments: text/plain (39 lines)

On Thu, Oct 13, 2011 at 05:56:14AM -0400, Bruno Le Floch wrote:

> > I wouldn't do it manually. There are mappings files for Unicode:
> >   http://unicode.org/Public/MAPPINGS/
> > In project I am using these mappings together with a perl script
> > to generate the .def files.
> 
> Thank you for the link. It seems that the simplest would be to
> directly use the tables provided there as the .def files. Simply
> \catcode`\#=14, and set a few other default catcodes, then input the
> file, looping over the lines. Are all of the lines of the form
> 
> 0xHH    0xHHHH    # comment

Not all, dec-mcs.txt is different:

  sprintf('=%02X     U+%04X  %s\n', <code>, <unicode>, <text>)
  no comments

> (or comment lines), with H = some hexadecimal digit? In other words,
> are all those encodings 8-bit only, and with only Unicode points
> <65536?

In the directory MAPPINGS there are encodings with > 8-bit.
And a quick look doesn't reveal Unicode points > U+FFFF.

> > It could be made even expandable in linear time
> > with a large lookup table (256).
> 
> Right. I was thinking in terms of UTF-8 for some reason, and the
> lookup table would be too big.

In practice the table would be larger than 256 (16x16) to support
lowercase and uppercase digits ([0-9a-fA-F]).
The size would be 484 = (10 + 2 x 6) x (10 + 2 x 6).

Yours sincerely
  Heiko Oberdiek

ATOM RSS1 RSS2

LISTSERV.UNI-HEIDELBERG.DE
Universität Heidelberg \| Impressum \| Datenschutzerklärung