
Section 11.1 was where BOMs were discussed, and it said: UTF-16. If a BOM is found then this is used to set the document information item [unicodeByteOrderMark] member, and all data with dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have the implied byte order. If no BOM is found then all data with dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have big-endian byte order. There is no need to model the BOM explicitly. UTF-32. If a BOM is found then this is used to set the document information item [unicodeByteOrderMark] member, and all data with dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have the implied byte order . If no BOM is found then all data with dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have big-endian byte order. There is no need to model the BOM explicitly. Same for unparsing. So it looks like we threw the baby out with the bath water when the section was removed! Regards Steve Hanson IBM Hybrid Integration, Hursley, UK Architect, IBM DFDL Co-Chair, OGF DFDL Working Group smh@uk.ibm.com tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: DFDL-WG <dfdl-wg@ogf.org> Date: 23/04/2020 20:23 Subject: [EXTERNAL] [DFDL-WG] Clarification on UTF-16 and UTF-32 encoding byte order Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org> Since we dropped the Unicode byte order mark functionality from DFDL v1.0, the issue arises of what byte order is used when dfdl:encoding="utf-16" or dfdl:encoding="utf-32". We are clear that encodings define their own byte and bit order, the dfdl:byteOrder property is not used. There are these options: 1) explicitly disallow these encoding names because they do not specify a byte order. Require utf-16BE or utf-16LE, utf-32BE or utf-32LE. 2) specify that these are synonyms for the BE versions 3) specify that these are synonyms for the LE versions This comes up in the definition of the dfdl:byteOrder property where the text currently says: This property is never used to establish the byte order for text /strings with Unicode fixed-width encodings that do not specify the byte order (UTF-16 and UTF-32). Having removed the unicode byte order mark feature, this statement leaves us without a stipulation of how UTF-16 and UTF-32 byte order would be determined. Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense | www.owlcyberdefense.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy -- dfdl-wg mailing list dfdl-wg@ogf.org https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=Fcdy3gjLiFedSAXDcPBeT7yEZ8U0hJgpMGhShem7wkg&s=BbM0rc3sw8Jp8g76MANRRquB3lhxoFgJHezX9OEzJ10&e= Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU