Section 11.1 was where BOMs were discussed,
and it said:
UTF-16. If a BOM is found then this
is used to set the document information item [unicodeByteOrderMark]
member, and all data with dfdl:encoding UTF-16 throughout the rest of the
stream are assumed to have the implied byte order. If
no BOM is found then all data with dfdl:encoding UTF-16 throughout the
rest of the stream are assumed to have big-endian byte order.
There is no need to model the BOM explicitly.
UTF-32. If a BOM is found then this
is used to set the document information item [unicodeByteOrderMark]
member, and all data with dfdl:encoding UTF-32 throughout the rest of the
stream are assumed to have the implied byte order . If
no BOM is found then all data with dfdl:encoding UTF-32 throughout the
rest of the stream are assumed to have big-endian byte order.
There is no need to model the BOM explicitly.
Same for unparsing.
So it looks like we threw the baby out with
the bath water when the section was removed!
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
DFDL-WG <dfdl-wg@ogf.org>
Date:
23/04/2020 20:23
Subject:
[EXTERNAL] [DFDL-WG]
Clarification on UTF-16 and UTF-32 encoding byte order
Sent by:
"dfdl-wg"
<dfdl-wg-bounces@ogf.org>
Since we dropped the Unicode byte order mark functionality
from DFDL v1.0, the issue arises of what byte order is used when dfdl:encoding="utf-16"
or dfdl:encoding="utf-32".
We are clear that encodings define their own byte and
bit order, the dfdl:byteOrder property is not used.
There are these options:
1) explicitly disallow these encoding names because they
do not specify a byte order. Require utf-16BE or utf-16LE, utf-32BE or
utf-32LE.
2) specify that these are synonyms for the BE versions
3) specify that these are synonyms for the LE versions
This comes up in the definition of the dfdl:byteOrder
property where the text currently says:
This property is never used to establish the byte order
for text /strings
with Unicode fixed-width encodings that do not specify
the byte order
(UTF-16 and UTF-32).
Having removed the unicode byte order mark feature, this
statement leaves us without a stipulation of how UTF-16 and UTF-32 byte
order would be determined.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber
Defense | www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU