Re: [DFDL-WG] dfdl-wg Digest, Vol 40, Issue 17 : Action 072 resolved

072 TK: Byte Order Mark and Unicode signature 16/12: Investigate whether the spec's position on UTF-16/32 BOM is implementable The implementation team have carried out tests on the Java and C implementations of ICU. The results are: Java ICU libraries Encoding Input BOM included in decoded string? UTF-8 <BOM>AAA yes UTF-16 <BOM>AAA yes UTF-16-LE <BOM>AAA yes UTF-16-BE <BOM>AAA yes UTF-32 <BOM>AAA no UTF-32-LE <BOM>AAA no UTF-32-BE <BOM>AAA no C ICU libraries: Encoding Input BOM included in decoded string? UTF-8 <BOM>AAA yes UTF-16 <BOM>AAA yes UTF-16-LE <BOM>AAA yes UTF-16-BE <BOM>AAA yes UTF-32 <BOM>AAA yes UTF-32-LE <BOM>AAA yes UTF-32-BE <BOM>AAA yes I suspect that the UTF-32 anomaly is a defect in ICU. I tried to confirm this using Google, but I didn't find any reference to it online. Before we conclude that the spec is OK as it stands, we should consider whether it is correct to treat a BOM as a character. The Unicode standard makes a clear distinction between characters and BOMs: http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf section 2.13 ( don't skip the first couple of paragraphs ) http://unicode.org/faq/utf_bom.html#bom1 regards, Tim Kimber, Common Transformation Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 246742 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (1)
-
Tim Kimber