Section 11.1 was where BOMs were discussed, and it said:
UTF-16. If a BOM is found then this is used to set the document information item [unicodeByteOrderMark] member, and all data with dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have the implied byte order. If no BOM is found then all data with dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have big-endian byte order. There is no need to model the BOM explicitly.
UTF-32. If a BOM is found then this is used to set the document information item [unicodeByteOrderMark] member, and all data with dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have the implied byte order . If no BOM is found then all data with dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have big-endian byte order. There is no need to model the BOM explicitly.
Same for unparsing.
So it looks like we threw the baby out with the bath water when the section was removed!
Regards
Steve HansonIBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike Beckerle <mbeckerle.dfdl@gmail.com>
To: DFDL-WG <dfdl-wg@ogf.org>
Date: 23/04/2020 20:23
Subject: [EXTERNAL] [DFDL-WG] Clarification on UTF-16 and UTF-32 encoding byte order
Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org>
Since we dropped the Unicode byte order mark functionality from DFDL v1.0, the issue arises of what byte order is used when dfdl:encoding="utf-16" or dfdl:encoding="utf-32".
We are clear that encodings define their own byte and bit order, the dfdl:byteOrder property is not used.
There are these options:
1) explicitly disallow these encoding names because they do not specify a byte order. Require utf-16BE or utf-16LE, utf-32BE or utf-32LE.
2) specify that these are synonyms for the BE versions
3) specify that these are synonyms for the LE versions
This comes up in the definition of the dfdl:byteOrder property where the text currently says:
This property is never used to establish the byte order for text /strings
with Unicode fixed-width encodings that do not specify the byte order
(UTF-16 and UTF-32).
Having removed the unicode byte order mark feature, this statement leaves us without a stipulation of how UTF-16 and UTF-32 byte order would be determined.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense | www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU