On 08/02/2015 12:36, Joseph Wright wrote: > Hello all, > > A few months ago now we added various expandable case changing functions > to expl3 with clearly 'experimental' status. I've recently had some > useful feedback on aspects of the behaviour and have revised some of the > code. I've now got some more questions, so thought it would be useful to > raise those here. (Note: I've updated the SVN code but this has yet to > go to CTAN. I can arrange a release if people want to test but not grab > via GitHub.) > > *Background* > > The current implementation has six functions > > \tl_upper_case:n > \tl_lower_case:n > \tl_mixed_case:n > \tl_upper_case:nn > \tl_lower_case:nn > \tl_mixed_case:nn > > where the two-argument versions deal with language-specific case > changing. The functions are x-type expandable. 'Letters' can be case > changed from the full Unicode range when using XeTeX/LuaTeX and the > mappings do not have to be 1-1 (cf. \uppercase/\lowercase). > > There is also \str_fold_case:n which does folding for programmatic > applications. That function has a different set of use cases and is not > considered further here. > > *Escaping from case changing* > > The current implementation follows a BibTeX-like convention for > preventing case changing: braced content is not changed. In the original > approach there was no mechanism to do case changing inside the argument > to a command as a result. I have now altered this to include a list of > commands where case changing should be applied, so for example it would > be possible to arrange that > > \tl_upper_case:n { Hello~\emph{world} } > > will case change the argument to \emph. At present, this functionality > is designed to work with commands taking one argument (i.e. a second or > subsequent argument will be unaffected). > > The alternative to such an approach is to case change everything and > provide an escape mechanism (cf. the textcase package and > \NoChangeCase). As a user, I can see advantages to both approaches. > > One thing that is not currently covered is dealing automatically with > math mode content. That is doable but would require some consistent > interface. In particular, while dealing with "$ ... $" and "\( ... \)" > is straight-forward (single-token delimiters), it would be more > challenging to cover "\begin{math} ... \end{math}" or similar. Some of > this has a relationship to expandability: see the next area. > > *Expandability* > > The current implementation is expandable as this allows the 'natural' usage > > \tl_set:Nx \l_tmpa_tl > { \tl_upper_case:n { foo } } > \tl_show:N \l_tmpa_tl % => "FOO" > > Expandablity imposes some restrictions on the code and does have a > performance knock-on. The need to deal with changes that are not 1-1 or > have other context-dependence means that the performance aspect is not > so important: a full solution using \uppercase/\lowercase would still > require a mapping or similar to deal with all of the possibilities. > > One area that is more tricky in this regard is input which is not fully > expanded. For example > > \def\myname{Joseph Wright} > \MakeUppercase{Written by \myname} > > will yield "WRITTEN BY JOSEPH WRIGHT" as there is an \edef inside the > LaTeX2e command before case changing. In contrast, the expl3 functions > currently do no expansion so > > \tl_upper_case:n { Written~by~\myname } > > gives "WRITTEN BY Joseph Wright". Notably, if used in setting a token > list the content would be "WRITTEN BY \myname", i.e. further expansion > is inhibited. > > It is not clear to me what the 'expected' outcome might be. It would be > possible to use f-type expansion to deal with stored tokens before case > changing, but for input such as > > \tl_upper_case:n { Written~by \\ Joseph~Wright } > > that could break outcomes with LaTeX2e: \\ would be 'lost' and this > would could problematic if the text was used later in for example a > center environment. A non-expandable implementation could use the same > logic as \MakeUppercase but at the cost that case changing for storage > would then need dedicated functions for example > > \tl_set_upper_case:Nn > \tl_set_lower_case:Nnn > > This looses the 'natural' approach to case changing inside a tl setting > and requires separate 'set a tl with case changing' and 'typeset case > changed text' functions. > > *LICR/Non-native input* > > The original implementation for the expl3 functions only case changes > letters. Adding an 'escape' to cover e.g. \emph also allows coverage of > things like "\'{e}" and so it was natural to consider LICR input. I have > therefore extended the code to allow coverage of everything handled by > \MakeUppercase when T1/T2A/T2B/T2C/T4/T5/LGR encodings are in use. There > is of course a performance hit, but this should be comparable to that > for processing letters. > > That then leaves the question of input outside of the ASCII range when > using pdfTeX. It would I think be possible to do this using an approach > detecting inputenc active chars, but I am reluctant to go this way (in > the longer term it will be increasingly hard to justify using a 8-bit > program as the world standardises on Unicode). With inputenc loaded case > changing does work if the input goes via LICR > > \documentclass{article} > \usepackage[utf8]{inputenc} > \usepackage{expl3} > \makeatletter > \ExplSyntaxOn > \cs_generate_variant:Nn \tl_upper_case:n { V } > \cs_new_protected:Npn \MakeExplUpperCase #1 > { > \group_begin: > \protected@edef \l_tmpa_tl {#1} > \tl_upper_case:V \l_tmpa_tl > \group_end: > } > \ExplSyntaxOff > \makeatother > \begin{document} > \MakeExplUpperCase{Héllo} > \end{document} > > Again, this has a link to expandability. > > *Naming* > > As noted in previous mails on this topic, the naming here (\tl_...) at > least in part reflects the fact this code is difficult name. Any better > naming schemes welcome! > > *Conclusions* > > The current code works but there are open questions. What I am hoping > for is feedback on the ideas and in particular what issues come up with > real use cases. Ideas about all or any of the above, or indeed other > aspects, most welcome. I've had some feedback via other channels and will summarise here 'for the record'. (Sources: transcript http://chat.stackexchange.com/transcript/message/19958526#19958526 onward and direct mail.) *Escaping from case changing* David Carlisle points out that using the BibTeX-like approach leaves a problem with ligatures. Whilst input such as {Text} rather than {T}ext does help, the alternative route taken by textcase \NoChangeCase{Text} allows for the 'escape' mechanism to be entirely transparent at the typesetting stage (as the appropriate commands can be equivalent to \use:n). Barbara Beeton provides a useful example where a brace group is 'trapped' inside a word with the BibTeX-like scheme as for example MacArthur => MacARTHUR requires input M{ac}Arthur with the current set up and this cannot be done to avoid a ligature break. I am therefore minded to alter the approach in this area to follow textcase: such a change will if done include adding a sensible set of standard commands to the 'ignore list' (\label, \ref, ...). Adopting a texcase-like approach also suggests that automatically handling math mode might be desirable: a first pass for that might well be based on matching single-token delimiters ($...$/\(...\) as standard settings) with logic that more complex arrangements will be best covered by the \NoChangeCase concept. *Expandability* One approach suggested (again by David C.) to this area is to start with an assumption of e-TeX (\robustify for the etoolbox package for example can be used to make existing commands e-TeX protected). With that assumption, it is relatively straight-forward to expand 'variable-like' macros and leave 'command-like' ones alone. (I already have code that does much the same in siunitx.) Retaining an expandable approach does seem sensible as it allows what many other languages do: case changing in a 'functional' sense (or rather as a macro language in an x-type expansion sense). As already noted, the need for contextual case mappings means that using the TeX primitives directly still requires a separate mapping phase and so performance issues are not so significant. *LICR/Non-native input* As the code here is being developed primarily for use to support future work, and that will increasingly mean Unicode-native engines, comments here suggest sticking to the 'ASCII/Unicode' line taken to date. As such, pdfTeX use with non-ASCII input will need pre-processing via \protected@edef as suggested to produce LICR data which can be handled correctly. Depending on other feedback, I will likely implement the above changes over the coming days and then look to update the release code. -- Joseph Wright