072
| TK: Byte Order Mark and Unicode
signature
16/12: Investigate whether the spec's position on UTF-16/32 BOM is implementable |
The implementation team have carried
out tests on the Java and C implementations of ICU. The results are:
Java ICU libraries
Encoding
| Input
| BOM included in decoded string?
|
UTF-8
| <BOM>AAA
| yes
|
UTF-16
| <BOM>AAA
| yes
|
UTF-16-LE
| <BOM>AAA
| yes
|
UTF-16-BE
| <BOM>AAA
| yes
|
UTF-32
| <BOM>AAA
| no
|
UTF-32-LE
| <BOM>AAA
| no
|
UTF-32-BE
| <BOM>AAA
| no |
C ICU libraries:
Encoding
| Input
| BOM included in decoded string?
|
UTF-8
| <BOM>AAA
| yes
|
UTF-16
| <BOM>AAA
| yes
|
UTF-16-LE
| <BOM>AAA
| yes
|
UTF-16-BE
| <BOM>AAA
| yes
|
UTF-32
| <BOM>AAA
| yes
|
UTF-32-LE
| <BOM>AAA
| yes
|
UTF-32-BE
| <BOM>AAA
| yes |
I suspect that the UTF-32 anomaly is
a defect in ICU. I tried to confirm this using Google, but I didn't find
any reference to it online.
Before we conclude that the spec is
OK as it stands, we should consider whether it is correct to treat a BOM
as a character. The Unicode standard makes a clear distinction between
characters and BOMs:
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf
section 2.13 ( don't skip the first couple of paragraphs )
http://unicode.org/faq/utf_bom.html#bom1
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU