Print

Print


> Such use of the byte sequence "EF BB BF" is a hack.  It has
> probability $2^{-24}$ as the initial three byte sequence in a stream
> of random bytes.

The equivalent in UTF 16 isn't a hack as that mandates that the BOM
appearing as the first two characters is always an encoding indicator.
(If you actually want to start with those characters you have to prepend
the byte order mark to the file). I've a feeling that utf8 has recently
been changed to similarly indicate that it isn't legal utf8 to have
those characters (as character data) at the start of the file.
This makes it a lot safer to "recognise" a UTF8 BOM as a BOM rather
than character data.

> Under the rules non-conforming XHTML
If you know it's XML then you know the encoding anyway, in particular it
is UTF8 unless you find an encoding declaration giving a different
encoding (or a UTF16 byte order mark). (ignoring complications like the
fact that the encoding can be specified in the transport, eg mime
headers, rather than in the file)

> <?xml ... encoding="utf-8"?>
as I say, utf-8 is the default encoding so that isn't necessary.
Also probably worth noting that XMl (unlike latex) does not enforce that
encodings have ascii characters in ascii positions, so it may be that
the above line will not be recognised at all (in cases where it is a non
standard encoding rather than utf8). An XML system might have to read
byte by byte to see if recognises the byte stream as the characters
<?xml
in any encoding that it knows about.

> or (some day) \usepackage[utf8]{inputenc}
typing utf8 inputenc latex into http://www.google.com indicates that you
can do that today, if you want. (not tried it though)

David