For discussion on today's call....

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 23/08/2011 14:53 -----

From:	Tim Kimber/UK/IBM
To:	Steve Hanson/UK/IBM@IBMGB
Cc:	"Mike Beckerle" <mbeckerle.dfdl@gmail.com>, "'Stephanie Fetzer'" <sfetzer@us.ibm.com>
Date:	23/08/2011 12:34
Subject:	RE: Fw: BOM disposal

I'd like to summarize where we have got to on this:
- The most common scenarios are the ones where the document starts with a BOM / Unicode Signature. Users will expect DFDL to handle these scenarios simply and easily.
- BOMs / Unicode Signatures at the start of an element/group will be less common, but they will sometimes crop up when an application writes a UTF-encoded stream directly into another document without removing the BOM. DFDL must be able to cope with this, but it doesn't have to be simple and elegant. On that basis, I suggest that we follow Steve's proposal, and we should include toleration of UTF8 BOMs at the start of a document.

I have a small concern about round-tripping, though. In cases where the input BOM is genuinely providing missing information to the DFDL processor. I don't see how the application writer can ensure that the infoset is written using the same byte order as when it was parsed. One solution would be
- add a new property 'characterEncoding ' to the Document Information Item. This would be set to the encoding that was in force at the start of the root element. But it would not indicate byte order....
- add a new property 'characterByteOrder ' to the Document Information Item. For UTF encodings only, this would be set from the implied byte order of the encoding, or from the BOM if there is no implied byte order. For non-UTF encodings it would have no meaning.
Without the characterByteOrder property, I don't see how the application writer can determine this information. The characterEncoding property is included because
a) it's probably useful in its own right and
b) the characterByteOrder property would look a little strange on its own

For BOMs that occur at the start of a fixed-length element, the user is almost certainly going to want the BOM to be a ZWNBSP in the info set.
For BOMs that occur at the start of a variable-length element, I can envisage a standard pattern for suppressing the BOM if that's what the modeller wants.
- Create a hidden group containing one element
- set minOccurs to zero on the hidden element.
- set the element's initiator to the bytes of the BOM using two, three or four %#r entities.

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742

From: Steve Hanson/UK/IBM
To: "Mike Beckerle" <mbeckerle.dfdl@gmail.com>
Cc: "'Stephanie Fetzer'" <sfetzer@us.ibm.com>, Tim Kimber/UK/IBM@IBMGB
Date: 23/08/2011 10:03
Subject: RE: Fw: BOM disposal

We have documentFinalTerminatorCanBeMissing so that modellers don't have the headache of explicitly modeling an 'optional' <CR><LF> at the end of a document. I don't see why we shouldn't assist Unicode modellers in a similar way. But only at document level.

UTF-16/32. I think when U+FEFF is encountered at any place other than the start of a DFDL described document, then it should be interpreted as ZWNBS. This is in keeping with the intent of the Unicode standard (as quoted by you in the other e-mail you forwarded).

UTF-8. I can go either way on this. Although not strictly a byte order control, it is something that may or may not appear at the start of a UTF-8 document and I can see Tim's argument for handling it seamlessly.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From:	"Mike Beckerle" <mbeckerle.dfdl@gmail.com>
To:	Tim Kimber/UK/IBM@IBMGB, Steve Hanson/UK/IBM@IBMGB
Cc:	"'Stephanie Fetzer'" <sfetzer@us.ibm.com>
Date:	23/08/2011 03:29
Subject:	RE: Fw: BOM disposal

Ok, so that’s interesting.

So we’re down to the issue of whether in a fixed-length string context, does a BOM count as one of the fixed length of characters, or not.

IMHO, I think BOM/ZWNBS should just be treated as another codepoint to us, and we shouldn’t be removing them, or treating them as “non-characters”.

As to whether to generate them, I think we should not. That is to say, regardless of whether we interpret them to determine the encoding, they should still be codepoints that appear in the infoset, both when parsing, and when unparsing. This means that they interact badly with things like initiators and padding. Hence, they’re very likely to be modeled as separate string elements containing only the BOM.

When encoding is UTF-16 or UTF-32, there is the question of whether one must have a BOM for every single string, or whether one must compute the byteOrder property from data, or if there is some “magic sticky behavior” where some prior string can have a BOM, and have this respected by subsequent string elements.

I suggest the following definition of “has a BOM to specify the byte order”
(1) The element, of type string, begins with the BOM codepoint. This changes the DFDL grammar. The Byte-order-mark field in the grammar would appear before the initiator of the element, and before any pad characters.
(2) An enclosing sequence has a nearest (greatest index) prior sibling which recursively “has a BOM to specify the byte order”

Inductively, this means the first element of a sequence can have a BOM, and all subsequent elements in that sequence as direct children or within subsequences/choices, and sub-elements generally, will all pick up their byteOrder from that same BOM.

It also lets you concatenate two representations, one of which is BOM big-endian, the other BOM little-endian.

This does add some overhead. In every case if encoding is utf-16, then for every string, you must check for a BOM.

From: Tim Kimber [mailto:KIMBERT@uk.ibm.com]
Sent: Monday, August 22, 2011 5:32 AM
To: Steve Hanson
Cc: mbeckerle.dfdl@gmail.com; Stephanie Fetzer
Subject: Re: Fw: BOM disposal

I disagree.

The term 'Byte Order Mark' is potentially misleading. It does not only indicate byte order - it also indicates the encoding of the stream, A BOM can legally be used at the start of a UTF-8 document, when it is more properly called a 'Unicode Signature'. Some text editors mark all their UTF-8 documents in this way ( including Eclipse on Linux, apparently ).

The Unicode standard 6.0 (http://www.unicode.org/versions/Unicode6.0.0/UnicodeStandard-6.0.pdf) says:

Unicode Signature. An initial BOM may also serve as an implicit marker to identify a file as
containing Unicode text. For UTF-16, the sequence FE16 FF16 (or its byte-reversed counterpart,
FF16 FE16) is exceedingly rare at the outset of text files that use other character
encodings. The corresponding UTF-8 BOM sequence, EF16 BB16 BF16, is also exceedingly
rare. In either case, it is therefore unlikely to be confused with real text data. The same is
true for both single-byte and multibyte encodings.
Data streams (or files) that begin with the U+FEFF byte order mark are likely to contain
Unicode characters. It is recommended that applications sending or receiving untyped data
streams of coded characters use this signature. If other signaling methods are used, signatures
should not be employed.
Conformance to the Unicode Standard does not require the use of the BOM as such a signature.
See Section 16.8, Specials, for more information on the byte order mark and its use
as an encoding signature.

This paragraph could be taken to imply that UTF-8 with a BOM is rare, but that does not appear to be the case in the real world:

While there is obviously no need for a byte order signature when using UTF-8,
there are occasions when processes convert UTF-16 or UTF-32 data containing
a byte order mark into UTF-8. When represented in UTF-8, the byte order
mark turns into the byte sequence <EF BB BF>. Its usage at the beginning of a
UTF-8 data stream is neither required nor recommended by the Unicode Standard,
but its presence does not affect conformance to the UTF-8 encoding
scheme. Identification of the <EF BB BF> byte sequence at the beginning of a
data stream can, however, be taken as a near-certain indication that the data
stream is using the UTF-8 encoding scheme.

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742

From: Steve Hanson/UK/IBM
To: mbeckerle.dfdl@gmail.com
Cc: Tim Kimber/UK/IBM@IBMGB, Stephanie Fetzer/Charlotte/IBM@IBMUS
Date: 18/08/2011 09:03
Subject: Fw: BOM disposal

Hi Mike

I've re-read the BOM and UTF-8 material and I agree with you. Explicit modelling of a ZWNBS character suffices for UTF-8.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

----- Forwarded by Steve Hanson/UK/IBM on 18/08/2011 08:58 -----

From: Mike Beckerle <mbeckerle.dfdl@gmail.com>
To: Steve Hanson/UK/IBM@IBMGB
Date: 17/08/2011 23:00
Subject: RE: BOM disposal

I think it is OK to add BOM control but I think the reference to utf8 and BOMs is wrong. We should never encode a BOM into utf8 and if a zwnbs is encoded in utf8 even as the first codepoint it should not ever be considered to be a BOM and should always go into the infoset.

----- Forwarded by Steve Hanson/UK/IBM on 18/08/2011 08:58 -----

From: Steve Hanson/UK/IBM
To: "Mike Beckerle" <mbeckerle.dfdl@gmail.com>
Cc: "'Stephanie Fetzer'" <sfetzer@us.ibm.com>, Tim Kimber/UK/IBM@IBMGB
Date: 17/08/2011 17:57
Subject: RE: BOM disposal

Hi Mike

I've read below and also the historical e-mail that you forwarded.

I am happy that when U+FEFF is encountered at any place other than the start of a DFDL described document, then it is interpreted as ZWNBS.

But I am concerned that we are making life harder than it need be for modellers who have to handle Unicode documents that start with a BOM.

Take the simple example of wanting to read in a file in one encoding, look at the DFDL infoset in order to make some routing decision, and then send it on in a different encoding. As the spec stands, for all encodings except those with a BOM the modeller can create a single DFDL model that uses external variable $encoding to control the output. But once you make one of the document's encoding Unicode with the possibility of a BOM then the model has to change to accomodate this in a non-trivial way. That's not very usable, and further I don't think it is in the spirit of another paragraph in RFC 2781...

All applications that process text with the "UTF-16" charset label
MUST be able to read at least the first two octets of the text and be
able to process those octets in order to determine the serialization
order of the text. Applications that process text with the "UTF-16"
charset label MUST NOT assume the serialization without first
checking the first two octets to see if they are a big-endian BOM, a
little-endian BOM, or not a BOM. All applications that process text
with the "UTF-16" charset label MUST be able to interpret both big-
endian and little-endian text.

Proposal:

On parsing: If encoding is set when starting to process the model, and is UTF-8, UTF-16, UTF-32 (including BE/LE variants) then the DFDL parser looks for a BOM.
If a BOM is found at the very start of the document then it is not added to the infoset, and:
- UTF-16, UTF-32: The DFDL byteOrder property is ignored for text data of those encodings throughout the rest of the document and the BOM implies the byte order
- UTF-8: The BOM is ignored as byte order is not used anyway.
- LE/BE variants. Processing error as this contravenes the Unicode standard..
If there is no BOM then byteOrder property behaves as currently stated for UTF-16 and UTF-32.

On unparsing: If encoding is set when starting to process the model, and is UTF-8, UTF-16 or UTF-32 (excluding BE/LE variants), then the DFDL unparser optionally outputs a BOM, under the control .of a new document-level property **, documentOutputBOM = yes/no. The BOM that is output depends on the setting of byteOrder.

There is one issue with this. I deliberately used the phrase 'if encoding is set when starting to process the model'. We have to define what this means. DFDL encoding applies to all text elements and all objects that have text delimiters. One option is to say that BOM processing only takes place if encoding is actually to be used by the first element in the model. So if I started my data with binary data that did not have an initiator then no BOM processing would take place. Another option is to say that BOM processing only takes place if there is a default dfdl:format in the xsd with encoding set (then you can imagine the BOM as an implicit hidden optional element that gets encoding from scope).

** (We already document level properties - documentFinalTerminatorCanBeMissing).

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From: "Mike Beckerle" <mbeckerle.dfdl@gmail.com>
To: Tim Kimber/UK/IBM@IBMGB
Cc: Steve Hanson/UK/IBM@IBMGB, "'Stephanie Fetzer'" <sfetzer@us.ibm.com>
Date: 15/08/2011 21:54
Subject: RE: BOM disposal

I stand corrected on the BOM character. This ZWNBS stuff means it *is* a character regardless of the Unicode folks having deprecated it (see http://en.wikipedia.org/wiki/Zero-width_non-breaking_space ) , or their goal of BOMs somehow being non-characters.

Though my guess is that it mostly would come up because UTF-16 with BOM was converted to UTF-8, with the BOM at the front converted to the UTF-8 encoding of a BOM. Concatenate some of these, and you’ll have ZWNBS characters embedded in the string.

I think there is a flock more cases beyond the ones Tim enumerated having to do with whether you remove the BOM or it takes up space in the string. E.g., if I have fixed length data with properties that say there is an optional BOM, is that data now variable length? I’d rather not go there. If I ask the length in characters of a string, do I count BOMs or not?

Either way, the point is that there is good reason to just treat these BOM/ZWNBS as characters, and to just fix the language in the spec about UTF-8 BOMs, which is just fixing a turn of phrase.

Stripping these characters out, that’s a calculation an application can easily do. (I could be talked into an XPath function in DFDL to do exactly this.)

The 2^nd paragraph about BOMs in the spec mentions they can be modeled. I believe the BOM-based behaviors described in Tim’s mail can all be modeled relatively easily as separate elements. They can then compute the value of the byteOrder property with an expression that references the elements. (I am assuming we allow byteOrder to be computed…. ). To be concrete about it:

E.g.,

<sequence>
<element name=”bom1” type=”byte” dfdl:representation=’binary’
Dfdl:outputValueCalc=”{0xFE}”/>
<element name=”bom2” type=”byte” dfdl:representation=’binary’
Dfdl:outputValueCalc=”{0xFF}”/>
<element name=”data” type=”string” dfdl:encoding=”utf-16”
Dfdl:byteOrder=”{ if (../bom1 = 0xFE and ../bom2 = 0xFF) then ‘bigEndian’
Else if (../bom1 = 0xFF and ../bom2 = 0xFE then ‘littleEndian’
Else error(‘no BOM found’)
}”
/>
</sequence>

One could even create a situation where BOM’s are accepted and tolerated:

<choice>
<…. The above sequence is one arm of the choice …>
<element name=”data” type=”string” dfdl:encoding=”utf-16be”/>
</choice>

This would cause a BOM to be accepted and used if present, and default to bigEndian otherwise. Output would always be bigEndian.

With some clever use of variables and type definitions, I suspect this can even be made reasonably compact.

These things are clumsy, but the alternative is more properties, and of all the cases Tim enumerated, we’re not even sure we have them all, or if anyone will use them.

Some much earlier DFDL draft had a unicodeByteOrderMarkPolicy property,…. I believe it was dropped for lack of clarity on exactly what the use cases needed to be. It was like ‘prohibited’ ‘tolerated’ ‘required’ ‘ignored’ ‘generated’ or some enumeration like that.

…mikeb

From: Tim Kimber [mailto:KIMBERT@uk.ibm.com]
Sent: Wednesday, July 27, 2011 4:32 PM
To: mbeckerle.dfdl@gmail.com
Cc: Steve Hanson; Stephanie Fetzer
Subject: BOM disposal

Key points about BOMs are:
- For all Unicode encodings, the "Zero Width Non-breaking Space" character corresponds to the byte sequence of a BOM, but...
- a BOM is not considered to be a part of the data

My own assumptions about BOMs are:
- some input documents will have a BOM by accident, just because the application that wrote it did not explicitly tell the encoder to omit the BOM.
- some users will expect a BOM at the start of an input document to be honoured
- most users will be surprised if they get a ZWNBSP in the info set. Some may even get a little annoyed if they find that they cannot prevent it, because the Unicode specification is pretty clear that BOMs are not data.

I think we need to modify the DFDL rules about handling of BOMs. I don't have all the answers, but I do think the following scenarios are likely to crop up:
Parsing:
a) there is a BOM at the start of the input document.{1} The user wants the DFDL parser to act as though the dfdl:encoding external variable had been set to the encoding implied by the BOM.
b) there is sometimes a BOM at the start of the input document. The character encoding is defined by the schema so the BOM is redundant. The user doesn't care whether it is there or not, and would like DFDL to completely ignore it.
c) at some point within the document ( not at the start ) there is a BOM at the beginning of an element. The user wants the BOM to be ignored.
d) at some point within the document ( not at the start ) there is a BOM at the beginning of an element. The user wants the encoding of the element to be defined by the BOM
e) the user wants a BOM to be treated exactly like an ordinary character ( probably with the aim of ensuring that the document round-trips without losing BOMs ).

Serializing
f) the user always wants the output document to start with a BOM when the encoding is one of the Unicode encodings
g) the user wants an element within the document to start with a BOM that signals its encoding

Feel free to come with other scenarios if you think I've missed any.

{1} I think I've done quite well to avoid any Monty Python 'Life of Brian' references so far...

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742

----- Forwarded by Tim Kimber/UK/IBM on 27/07/2011 20:59 -----

From: Steve Hanson/UK/IBM
To: mbeckerle.dfdl@gmail.com
Cc: Tim Kimber/UK/IBM@IBMGB
Date: 27/07/2011 19:15
Subject: OGF DFDL WG Call Agenda 2011-08-09

Hi Mike

I've posted a draft agenda on GridForge below for 9th Aug call.

The last of the spec issues you raised concerned section 12.3.7.1.3 about BOMs. I know that Tim is not happy with this either, and has done some thinking in this area. However he is on vacation 9th Aug. It might be worth you two getting together before then and discussing?

-----------------------------------------------------------------------------------------------------------

Please find agenda for the above call on GridForge at:

http://forge.gridforum.org/sf/docman/do/downloadDocument/projects.dfdl-wg/docman.root.current_0.calls/doc16305/1

As per action 144 an errata to the spec has been created here: http://forge.gridforum.org/sf/go/doc16280?nav=1

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

From:	Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:	Steve Hanson/UK/IBM@IBMGB
Date:	17/08/2011 23:00
Subject:	RE: BOM disposal

From:	Steve Hanson/UK/IBM
To:	"Mike Beckerle" <mbeckerle.dfdl@gmail.com>
Cc:	"'Stephanie Fetzer'" <sfetzer@us.ibm.com>, Tim Kimber/UK/IBM@IBMGB
Date:	17/08/2011 17:57
Subject:	RE: BOM disposal

From:	"Mike Beckerle" <mbeckerle.dfdl@gmail.com>
To:	Tim Kimber/UK/IBM@IBMGB
Cc:	Steve Hanson/UK/IBM@IBMGB, "'Stephanie Fetzer'" <sfetzer@us.ibm.com>
Date:	15/08/2011 21:54
Subject:	RE: BOM disposal