072
TK: Byte Order Mark and Unicode signature
16/12: Investigate whether the spec's position on UTF-16/32 BOM is implementable


The implementation team have carried out tests on the Java and C implementations of ICU.  The results are:

Java ICU libraries
Encoding Input BOM included in decoded string?
UTF-8 <BOM>AAA yes
UTF-16 <BOM>AAA yes
UTF-16-LE <BOM>AAA yes
UTF-16-BE <BOM>AAA yes
UTF-32 <BOM>AAA no
UTF-32-LE <BOM>AAA no
UTF-32-BE <BOM>AAA no


C ICU libraries:
Encoding Input BOM included in decoded string?
UTF-8 <BOM>AAA yes
UTF-16 <BOM>AAA yes
UTF-16-LE <BOM>AAA yes
UTF-16-BE <BOM>AAA yes
UTF-32 <BOM>AAA yes
UTF-32-LE <BOM>AAA yes
UTF-32-BE <BOM>AAA yes


I suspect that the UTF-32 anomaly is a defect in ICU. I tried to confirm this using Google, but I didn't find any reference to it online.

Before we conclude that the spec is OK as it stands, we should consider whether it is correct to treat a BOM as a character. The Unicode standard makes a clear distinction between characters and BOMs:
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf section 2.13 ( don't skip the first couple of paragraphs )
http://unicode.org/faq/utf_bom.html#bom1

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742  
Internal tel. 246742






Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU