LATEX-L Archives

Mailing list for the LaTeX3 project

LATEX-L@LISTSERV.UNI-HEIDELBERG.DE

Options: Use Classic View

Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Topic: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Joseph Wright <[log in to unmask]>
Mon, 30 Jun 2014 09:22:39 +0100
text/plain (102 lines)
Hello all,

To support case-changing operations in expl3, the team some time ago
added an experimental pair
\tl_expandable_uppercase:n/\tl_expandable_lowercase:n as alternatives to
\tl_to_uppercase:n/\tl_to_lowercase:n. While the expandable operations
are useful, there are issues both in terms of naming (solvable) and
functionality (more complex). In particular, they cover only the ASCII
range and do not offer some of the context-sensitive case changing that
is required for languages other than English.

In order to address this, we have now added a new set of experimental
functions to l3candidates:

 - \tl_upper_case:n(n)
 - \tl_lower_case:n(n)
 - \tl_mixed_case:n(n)

These are x-type expandable, so can be used inside for example
\tl_set:Nx, and when used with XeTeX or LuaTeX offer full UTF-8
character coverage. What we are hoping for is some feedback on the
interfaces, naming, etc.: we believe that the ideas are useful, and hope
in the longer term to use these to replace
\tl_expandable_(upper|lower)case:n and \tl_to_(upper|lower)case:n for
case changing. (The latter will still be required for generating
non-standard catcodes: we will provide a better interface for that
process at a later date.)

(We note that while expandability is not absolutely required in this
area, there are advantages to being able to simple set a tl to the
case-changed version of text. We therefore feel that expandability is
desirable here and the approach we have taken to some technical issues
reflects this.)

The versions with one argument do a relatively simple
language-insensitive mapping:

  \tl_lower_case:n { HELLO } => "hello"
  \tl_upper_case:n { hello } => "HELLO"
  \tl_mixed_case:n { HELLO } => "Hello"

while the two-argument versions can do language-dependent changes, such
as dotted/dotless-i/I handling in Turkish:

  \tl_upper_case:nn { tr } { i } => "İ"

The 'mixed' case variant is the low-level command needed to implement
'sentence' or 'title' case (the Unicode Consortium refer to both the
low-level and higher-level mapping as title casing): here, there is no
attempt to pick up on 'words' in a 'sentence'. (Once discussion on these
lower level functions is complete, we will look to see how best to
provide higher-level code for title/sentence casing: these operations
clearly apply to 'text' not 'token lists'.)

Some of what is required here is clear from the Unicode docs.
Implementing some of the requirements in TeX, particularly in an
expandable form, requires some modification of the described algorithms.
Thus areas where feedback is particularly welcome include:

 - Brace groups/escaping: the current version takes an approach similar
   to BibTeX, treating all brace groups as 'preserved'. This is a
   clear rule but leaves open questions on how (if at all) to handle
   commands in 'text'. Notably, these functions are intended for 'text
   like' input, so this may not be an issue. Notice that math mode
   is given no special treatment but can be protected from case
   changing by bracing.

 - Category code treatment: should case operations apply to chars on
   a string-like basis (current approach) or only to 'letters'. Again,
   as these functions seem to target 'text', category codes may not be
   as important here as in some other context.

 - Chars to skip at the start of 'text' when doing 'mixed' casing
   (what counts as the first 'letter').

 - The 'final sigma' rule in Greek: trying to handle all cases here is
   challenging in an expandable TeX system, and so we have implemented
   a more limited approach which counts a sigma as 'final' if followed
   by a small set of chars (currently a space or one of "!'),.:;?]}")

 - The 'dot above' rule in Lithuanian: we have again implemented this
   using a more restricted approach than the Unicode docs described,
   focussing only on chars/accents which are (we understand) used in
   Lithuanian

 - Whether 'mixed' case is a clear description of the idea of
   (informally) upper casing the first letter in 'text' and then
   lower casing the remainder.

The code has not gone to CTAN yet but is available on the GitHub mirror.
See in particular
https://github.com/latex3/svn-mirror/blob/master/l3kernel/l3candidates.dtx
and
https://github.com/latex3/svn-mirror/blob/master/l3kernel/l3unicode-data.def:
the latter is needed as it contains the data used for the transformations.

Feedback on all of this is very welcome: we hope to provide a
high-quality interface for case changing such that it can readily be
applied to a range of situations.
-- 
Joseph Wright

ATOM RSS1 RSS2