Tested on Linux & Windows, and the behaviour is

1. assumes big-endian

Which matches what the spec said post-erratum for when there was no BOM (the sentences I highlighted in blue below).

So I think we have Plan D - allow UTF-16 & UTF-32 without byte order specifiers and always treat as BE.  Anything else is not compatible with IBM DFDL.

If there is a BOM in the data then it must be modelled, otherwise it will end up being treated as part of a value or delimiter.

I've attached a zip that does this, which I created back in 2018 as proof we could drop the BOM support.  The BOM element could be wrapped in a hidden group, can't do that with IBM DFDL.



Regards
 
Steve Hanson

IBM Hybrid Integration, Hursley, UK
Architect,
IBM DFDL
Co-Chair,
OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday




From:        Steve Hanson/UK/IBM
To:        Mike Beckerle <mbeckerle.dfdl@gmail.com>
Cc:        DFDL-WG <dfdl-wg@ogf.org>
Date:        28/04/2020 19:18
Subject:        Re: [EXTERNAL] Re: [DFDL-WG] Clarification on UTF-16 and UTF-32 encoding byte order




Plan A is appealing, but IBM DFDL allows UTF-16 / 32 unadorned, and always has done. Problem is I can't remember what that does. IBM DFDL never implemented erratum 3.7 so there's four possibilities:

1. assumes big-endian
2. assumes little-endian
3. platform-dependent (no!)
4. uses dfdl:byteOrder (as per GFD.174)

I need to do some tests to ascertain the behaviour. I'll try and do that by next WG call.

Regards
 
Steve Hanson

IBM Hybrid Integration, Hursley, UK
Architect,
IBM DFDL
Co-Chair,
OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday





From:        Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:        Steve Hanson <smh@uk.ibm.com>
Cc:        DFDL-WG <dfdl-wg@ogf.org>
Date:        28/04/2020 16:00
Subject:        [EXTERNAL] Re: [DFDL-WG] Clarification on UTF-16 and UTF-32 encoding byte order






So the important use case is this:

The data is dfdl:representation='text', the encoding is utf-16 but a BOM tells us the byte order - but there is exactly one BOM, and it is at the start of the file.
That BOM wants to be read, to stick, and to guide recreation of this data in any associated unparse.

That special case is why we had this unicode byte order mark feature on the document level.

What we failed to appreciate is that the byte order of this data will not vary day to day, but will nearly always be constant. Data coming from big-endian systems will be big-endian, and from little-endian systems will be little endian.

So the use case for a schema that needs to adapt to either is a more rare case. That's why the features we used to have were overkill, because in the vast number of cases above, the users data will always have the same byte order, because it is coming from one system.

Where we are today: we have already modified the DFDL spec draft to remove everything about byte order marks EXCEPT, we didn't remove support for UTF-16 or UTF-32 where a BOM might come in handy.

I think to fix this it's either plan A or plan B.

Plan (A) - Keep it simple - Just disallow utf-16 and utf-32 without byte-order specifiers - make people use the more specific encodings that specify byte order. If they in fact have data which varies in byte order from instance to instance, they have to model that... just as they would for binary data with that behavior.  (We can supply this as sample code.)

Plan (B) go back to what we had before. All of it. Even though nobody implemented it nor wants to.

My preference is plan (A). I think this is entirely sufficient for DFDL v1.0.

There's one other Plan (C) option, which would be to document that Utf-16 unadorned means this: accept the BOM, keep it as a character in the string, and use it on parse to interpret the rest of the characters. It would also, preserve a BOM character if present for unparse, but unparse would always be big endian - the BOM written (only written if the character is present at start of string) will be written as a Big-endian BOM. If not present, none is added. The other characters are always written big-endian.  This is the "converts to BE" model. It's what the java utf16 encoders/decoders do if you do nothing special to force them to behave any particular way.

Thoughts?

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense | www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy



On Tue, Apr 28, 2020 at 4:13 AM Steve Hanson <smh@uk.ibm.com> wrote:
Section 11.1 was where BOMs were discussed, and it said:
UTF-16.  If a BOM is found then this is used to set the document information item [unicodeByteOrderMark] member, and all data with dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have the implied byte order. If no BOM is found then all data with dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have big-endian byte order. There is no need to model the BOM explicitly.

UTF-32.  If a BOM is found then this is used to set the document information item [unicodeByteOrderMark] member, and all data with dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have the implied byte order . If no BOM is found then all data with dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have big-endian byte order. There is no need to model the BOM explicitly.


Same for unparsing.


So it looks like we threw the baby out with the bath water when the section was removed!


Regards
 
Steve Hanson

IBM Hybrid Integration, Hursley, UK
Architect,
IBM DFDL
Co-Chair,
OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday




From:        
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:        
DFDL-WG <dfdl-wg@ogf.org>
Date:        
23/04/2020 20:23
Subject:        
[EXTERNAL] [DFDL-WG] Clarification on UTF-16 and UTF-32 encoding byte order
Sent by:        
"dfdl-wg" <dfdl-wg-bounces@ogf.org>





Since we dropped the Unicode byte order mark functionality from DFDL v1.0, the issue arises of what byte order is used when dfdl:encoding="utf-16" or dfdl:encoding="utf-32".

We are clear that encodings define their own byte and bit order, the dfdl:byteOrder property is not used.

There are these options:
1) explicitly disallow these encoding names because they do not specify a byte order. Require utf-16BE or utf-16LE, utf-32BE or utf-32LE.
2) specify that these are synonyms for the BE versions
3) specify that these are synonyms for the LE versions

This comes up in the definition of the dfdl:byteOrder property where the text currently says:

This property is never used to establish the byte order for text /strings
with Unicode fixed-width encodings that do not specify the byte order
(UTF-16 and UTF-32).

Having removed the unicode byte order mark feature, this statement leaves us without a stipulation of how UTF-16 and UTF-32 byte order would be determined.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense |
www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are subject to the
OGF Intellectual Property Policy
--
 dfdl-wg mailing list
 
dfdl-wg@ogf.org
 
https://www.ogf.org/mailman/listinfo/dfdl-wg


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU



Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU