072

TK: Byte Order Mark and Unicode signature
16/12: Investigate whether the spec's position on UTF-16/32 BOM is implementable

The implementation team have carried out tests on the Java and C implementations of ICU. The results are:

Java ICU libraries

Encoding	Input	BOM included in decoded string?
UTF-8	<BOM>AAA	yes
UTF-16	<BOM>AAA	yes
UTF-16-LE	<BOM>AAA	yes
UTF-16-BE	<BOM>AAA	yes
UTF-32	<BOM>AAA	no
UTF-32-LE	<BOM>AAA	no
UTF-32-BE	<BOM>AAA	no

C ICU libraries:

Encoding	Input	BOM included in decoded string?
UTF-8	<BOM>AAA	yes
UTF-16	<BOM>AAA	yes
UTF-16-LE	<BOM>AAA	yes
UTF-16-BE	<BOM>AAA	yes
UTF-32	<BOM>AAA	yes
UTF-32-LE	<BOM>AAA	yes
UTF-32-BE	<BOM>AAA	yes

I suspect that the UTF-32 anomaly is a defect in ICU. I tried to confirm this using Google, but I didn't find any reference to it online.

Before we conclude that the spec is OK as it stands, we should consider whether it is correct to treat a BOM as a character. The Unicode standard makes a clear distinction between characters and BOMs:
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf section 2.13 ( don't skip the first couple of paragraphs )
http://unicode.org/faq/utf_bom.html#bom1

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU