Print

Print


Roozbeh Pournader <[log in to unmask]> writes:

> Also, many applications shipped with Windows 2000 attach a signature
> to the start of file (U+FEFF, Zero-Width No-Break Space) when they
> want to save the file, so that will make the autodetection much
> easier. The Unicode Standard accepts this as an autodetection
> mechanism, and says that this sequence (EF BB BF in UTF-8) is really
> improbable anywhere other than a UTF-8 file.

Such use of the byte sequence "EF BB BF" is a hack.  It has
probability $2^{-24}$ as the initial three byte sequence in a stream
of random bytes.  In many locales it is even printable and screen
representable, and who knows what it represents in someone else's
locale now or in the future.

> Although, I do not have a good experience with that, I don't like my
> HTML files becoming non-conformant according to Unix checkers I have.

Under the rules non-conforming XHTML (next generation HTML) is supposed
to be rejected by a conforming XML processor.  Non valid XHTML will have
a high probability of failure to convey correctly the author's intent.

The correct way to indicate utf-8 encoding is with something like

<?xml ... encoding="utf-8"?>

or in another context

Content-type: text/plain; charset="utf-8"

or (some day) \usepackage[utf8]{inputenc}

or ...

as appropriate in the context.

                                     -- Bill