Re: [DFDL-WG] dfdl-wg Digest, Vol 40, Issue 17 : Action 072 resolved

18 Dec 2009

      072
TK: Byte Order Mark and Unicode signature 
16/12: Investigate whether the spec's position on UTF-16/32 BOM is 
implementable

The implementation team have carried out tests on the Java and C 
implementations of ICU.  The results are:

Java ICU libraries

Encoding
Input
BOM included in decoded string?
UTF-8
<BOM>AAA
yes
UTF-16
<BOM>AAA
yes
UTF-16-LE
<BOM>AAA
yes
UTF-16-BE
<BOM>AAA
yes
UTF-32
<BOM>AAA
no
UTF-32-LE
<BOM>AAA
no
UTF-32-BE
<BOM>AAA
no

C ICU libraries:

Encoding
Input
BOM included in decoded string?
UTF-8
<BOM>AAA
yes
UTF-16
<BOM>AAA
yes
UTF-16-LE
<BOM>AAA
yes
UTF-16-BE
<BOM>AAA
yes
UTF-32
<BOM>AAA
yes
UTF-32-LE
<BOM>AAA
yes
UTF-32-BE
<BOM>AAA
yes

I suspect that the UTF-32 anomaly is a defect in ICU. I tried to confirm 
this using Google, but I didn't find any reference to it online.

Before we conclude that the spec is OK as it stands, we should consider 
whether it is correct to treat a BOM as a character. The Unicode standard 
makes a clear distinction between characters and BOMs:
http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf section 2.13 ( don't 
skip the first couple of paragraphs )
http://unicode.org/faq/utf_bom.html#bom1

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Tim Kimber

tags

participants (1)