Re: [DFDL-WG] Clarification on UTF-16 and UTF-32 encoding byte order

29 Apr 2020

      Plan A is appealing, but IBM DFDL allows UTF-16 / 32 unadorned, and always 
has done. Problem is I can't remember what that does. IBM DFDL never 
implemented erratum 3.7 so there's four possibilities:

1. assumes big-endian
2. assumes little-endian
3. platform-dependent (no!)
4. uses dfdl:byteOrder (as per GFD.174)

I need to do some tests to ascertain the behaviour. I'll try and do that 
by next WG call.

Regards

Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 

From:   Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:     Steve Hanson <smh@uk.ibm.com>
Cc:     DFDL-WG <dfdl-wg@ogf.org>
Date:   28/04/2020 16:00
Subject:        [EXTERNAL] Re: [DFDL-WG] Clarification on UTF-16 and 
UTF-32 encoding byte order

So the important use case is this:

The data is dfdl:representation='text', the encoding is utf-16 but a BOM 
tells us the byte order - but there is exactly one BOM, and it is at the 
start of the file. 
That BOM wants to be read, to stick, and to guide recreation of this data 
in any associated unparse. 

That special case is why we had this unicode byte order mark feature on 
the document level.

What we failed to appreciate is that the byte order of this data will not 
vary day to day, but will nearly always be constant. Data coming from 
big-endian systems will be big-endian, and from little-endian systems will 
be little endian. 

So the use case for a schema that needs to adapt to either is a more rare 
case. That's why the features we used to have were overkill, because in 
the vast number of cases above, the users data will always have the same 
byte order, because it is coming from one system. 

Where we are today: we have already modified the DFDL spec draft to remove 
everything about byte order marks EXCEPT, we didn't remove support for 
UTF-16 or UTF-32 where a BOM might come in handy.

I think to fix this it's either plan A or plan B. 

Plan (A) - Keep it simple - Just disallow utf-16 and utf-32 without 
byte-order specifiers - make people use the more specific encodings that 
specify byte order. If they in fact have data which varies in byte order 
from instance to instance, they have to model that... just as they would 
for binary data with that behavior.  (We can supply this as sample code.)

Plan (B) go back to what we had before. All of it. Even though nobody 
implemented it nor wants to. 

My preference is plan (A). I think this is entirely sufficient for DFDL 
v1.0. 

There's one other Plan (C) option, which would be to document that Utf-16 
unadorned means this: accept the BOM, keep it as a character in the 
string, and use it on parse to interpret the rest of the characters. It 
would also, preserve a BOM character if present for unparse, but unparse 
would always be big endian - the BOM written (only written if the 
character is present at start of string) will be written as a Big-endian 
BOM. If not present, none is added. The other characters are always 
written big-endian.  This is the "converts to BE" model. It's what the 
java utf16 encoders/decoders do if you do nothing special to force them to 
behave any particular way. 

Thoughts?

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense | 
www.owlcyberdefense.com
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy

On Tue, Apr 28, 2020 at 4:13 AM Steve Hanson <smh@uk.ibm.com> wrote:
Section 11.1 was where BOMs were discussed, and it said: 
UTF-16.  If a BOM is found then this is used to set the document 
information item [unicodeByteOrderMark] member, and all data with 
dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have 
the implied byte order. If no BOM is found then all data with 
dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have 
big-endian byte order. There is no need to model the BOM explicitly. 
UTF-32.  If a BOM is found then this is used to set the document 
information item [unicodeByteOrderMark] member, and all data with 
dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have 
the implied byte order . If no BOM is found then all data with 
dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have 
big-endian byte order. There is no need to model the BOM explicitly. 

Same for unparsing. 

So it looks like we threw the baby out with the bath water when the 
section was removed! 

Regards

Steve Hanson 
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 

From:        Mike Beckerle <mbeckerle.dfdl@gmail.com> 
To:        DFDL-WG <dfdl-wg@ogf.org> 
Date:        23/04/2020 20:23 
Subject:        [EXTERNAL] [DFDL-WG] Clarification on UTF-16 and UTF-32 
encoding byte order 
Sent by:        "dfdl-wg" <dfdl-wg-bounces@ogf.org> 

Since we dropped the Unicode byte order mark functionality from DFDL v1.0, 
the issue arises of what byte order is used when dfdl:encoding="utf-16" or 
dfdl:encoding="utf-32". 

We are clear that encodings define their own byte and bit order, the 
dfdl:byteOrder property is not used. 

There are these options: 
1) explicitly disallow these encoding names because they do not specify a 
byte order. Require utf-16BE or utf-16LE, utf-32BE or utf-32LE. 
2) specify that these are synonyms for the BE versions 
3) specify that these are synonyms for the LE versions 

This comes up in the definition of the dfdl:byteOrder property where the 
text currently says: 

This property is never used to establish the byte order for text /strings 
with Unicode fixed-width encodings that do not specify the byte order 
(UTF-16 and UTF-32). 

Having removed the unicode byte order mark feature, this statement leaves 
us without a stipulation of how UTF-16 and UTF-32 byte order would be 
determined. 

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense | 
www.owlcyberdefense.com 
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy 
--
 dfdl-wg mailing list
 dfdl-wg@ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU