
Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 12/11/2013 13:55 ----- From: Steve Hanson/UK/IBM To: Alex Wood1/UK/IBM@IBMGB, Date: 12/11/2013 12:19 Subject: Re: decoding UTF-16 sequence with an unpaired surrogate in ICU. Thanks Alex. So we can control what ICU does in this scenario using dfdl:encodingErrorPolicy in the expected way, as the DFDL spec says. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Alex Wood1/UK/IBM To: Steve Hanson/UK/IBM@IBMGB, Date: 12/11/2013 12:12 Subject: decoding UTF-16 sequence with an unpaired surrogate in ICU. So I coded a java program to test this in ICU4J So when decoding in ICU it seems to class an unpaired UTF-16 surrogate as malformed input. ICU API allows the programmer to specify the behaviour for malformed input. ignore, replace or report the offending code point. default is to report it and therefore the decode would fail with an error. the ICU4C api has similar options available. test program: public class test1 { /** * @param args */ public static void main(String[] args) { // TODO Auto-generated method stub final byte[] byteArray = { (byte) 0xD8, 0x34, (byte) 0xDD, 0x1E, (byte) 0xD8, 0x34}; CharsetProvider cp = new CharsetProviderICU(); CharsetDecoder decoder = cp.charsetForName("UTF-16" ).newDecoder(); decoder.onMalformedInput(CodingErrorAction.IGNORE); decoder.reset(); ByteBuffer bb = ByteBuffer.wrap(byteArray, 0, 6); CharBuffer cb = CharBuffer.allocate(6); CoderResult decodeResult = decoder.decode(bb, cb, true); if (decodeResult.isMalformed() || decodeResult.isUnmappable()) { System.out.println("Error at " + bb.position() ); } System.out.println("Result" + cb.toString() ); } } Kind Regards, - Alex Alex Wood - Software Engineer - WebSphere Message Broker Development DFDL Development MP 211, IBM UK Labs, Hursley Park, Winchester, Hants. SO21 2JN. Tel: Internal 246272, External 01962 816272 Notes: Alex Wood1/UK/IBM@IBMGB e-mail: wooda@uk.ibm.com Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU