Subject: | |
From: | |
Reply To: | |
Date: | Sun, 11 Feb 2001 16:46:57 -0500 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
Roozbeh Pournader <[log in to unmask]> writes:
> Also, many applications shipped with Windows 2000 attach a signature
> to the start of file (U+FEFF, Zero-Width No-Break Space) when they
> want to save the file, so that will make the autodetection much
> easier. The Unicode Standard accepts this as an autodetection
> mechanism, and says that this sequence (EF BB BF in UTF-8) is really
> improbable anywhere other than a UTF-8 file.
Such use of the byte sequence "EF BB BF" is a hack. It has
probability $2^{-24}$ as the initial three byte sequence in a stream
of random bytes. In many locales it is even printable and screen
representable, and who knows what it represents in someone else's
locale now or in the future.
> Although, I do not have a good experience with that, I don't like my
> HTML files becoming non-conformant according to Unix checkers I have.
Under the rules non-conforming XHTML (next generation HTML) is supposed
to be rejected by a conforming XML processor. Non valid XHTML will have
a high probability of failure to convey correctly the author's intent.
The correct way to indicate utf-8 encoding is with something like
<?xml ... encoding="utf-8"?>
or in another context
Content-type: text/plain; charset="utf-8"
or (some day) \usepackage[utf8]{inputenc}
or ...
as appropriate in the context.
-- Bill
|
|
|