Please find a proposal from the WG to simplify modeling Unicode documents that start with a BOM.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

-----------------------------------------------------------------------------------------------------------------------
Infoset

New property added to the document information item:

[unicodeByteOrderMark] String. When the encoding of the root element of the document is exactly UTF-8, UTF-16, or UTF-32 (or CCSID equivalent), the value indicates whether the document starts with a byte order mark (BOM). It there is a BOM then for UTF-8 encoding the value is 'UTF-8'; for UTF-16 encoding the value is 'UTF-16LE' or 'UTF-16BE'; for UTF-32 the value is 'UTF-32LE' or 'UTF-32BE'. If there is no BOM then the value is empty. When the encoding of the root element of the document is any other encoding, the value is empty.

Grammar

The grammar production for the overall document changes to:

Document = UnicodeByteOrderMark Element

Parsing

When the dfdl:encoding property of the root element is specified, and is exactly one of UTF-8, UTF-16, or UTF-32 (or CCSID equivalents), then a DFDL parser will look for the appropriate byte order mark (BOM) as the very first bytes in the data stream.

UTF-8. If a BOM is found then this is used to set the document information item 'unicodeByteOrderMark' property. If no BOM is found the parser takes no action. There is no need to model the BOM explicitly.

UTF-16. If a BOM is found then this is used to set the document information item 'unicodeByteOrderMark' property value, and all data with dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have the implied byte order . If no BOM is found then all data with dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have big-endian byte order. There is no need to model the BOM explicitly.

UTF-32. If a BOM is found then this is used to set the document information item 'unicodeByteOrderMark' property value, and all data with dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have the implied byte order . If no BOM is found then all data with dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have big-endian byte order. There is no need to model the BOM explicitly.

When the dfdl:encoding property of the root element is specified, and is exactly one of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE (or CCSID equivalents), then a DFDL parser will not look for appropriate BOM. The byte order to use is implicit in the encoding. If a BOM does appear at the start of the data stream, then it must be explicitly modelled as such, otherwise if parsed as part of an xs:string it will be interpreted as a zero-width non-breaking space (ZWNBS) character.

The dfdl:byteOrder property is never used to establish the byte order for Unicode encodings.

The parser never looks for a BOM at any other point in the data stream. If a BOM does appear at any other point, then it must be explicitly modelled, otherwise if parsed as part of an xs:string it will be interpreted as a ZWNBS character.

Unparsing

When the dfdl:encoding property of the root element is specified, and is exactly one of UTF-8, UTF-16 or UTF-32 (or CCSID equivalents), then a DFDL unparser will look in the infoset document information item for a byte order mark (BOM).

UTF-8. If the document information item 'unicodeByteOrderMark' property value is 'UTF-8', the UTF-8 BOM is output as the very first bytes in the data stream. If the property is empty then no BOM is output. If the property has any other value, it is a processing error.There is no need to model the BOM explicitly.

UTF-16. If the document information item 'unicodeByteOrderMark' property is 'UTF-16LE' or 'UTF-16BE', the corresponding UTF-16 BOM is output as the very first bytes in the data stream, and all data with dfdl:encoding UTF-16 throughout the rest of the document will be output with the implied byte order . If the property is empty then no BOM is output, and all data with dfdl:encoding UTF-16 throughout the rest of the document are assumed to have big-endian byte order. If the property has any other value, it is a processing error. There is no need to model the BOM explicitly.

UTF-32. If the document information item 'unicodeByteOrderMark' property is 'UTF-32LE' or 'UTF-32BE', the corresponding UTF-32 BOM is output as the very first bytes in the data stream, and all data with dfdl:encoding UTF-32 throughout the rest of the document will be output with the implied byte order . If the property is empty then no BOM is output, and all data with dfdl:encoding UTF-32 throughout the rest of the document are assumed to have big-endian byte order. If the property has any other value, it is a processing error. There is no need to model the BOM explicitly.

When the dfdl:encoding property of the root element is specified, and is exactly one of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE (or CCSID equivalents), then a DFDL unparser will not look in the infoset document information item for a BOM and will not output a BOM. The byte order to use is implicit in the encoding. If a BOM does need to be output at the start of the data stream, then it must be explicitly modelled as such.

The dfdl:byteOrder property is never used to establish the byte order for Unicode encodings.

The unparser never outputs a BOM at any other point in the data stream. If a BOM needs to appear, then it must be explicitly modelled as such.

-----------------------------------------------------------------------------

Explicit modelling of a BOM.

Below is a natural way to model a Unicode data stream with an optional BOM. The parser starts with root element 'doc' and finds (scoped) dfdl:encoding is 'UTF-16BE' from the variable's default so it does not look for a BOM automatically. Then it parses element 'bom' and sets the variable according to the value. If there is no BOM in the data then the assert fails and the parser backtracks (note use of minOccurs="0"). Subsequent parsing of element 'data' uses the variable value.

The model would also work if the variable's default had been 'UTF16'. The parser looks for the BOM explicitly in the stream, finds it, and sets the byte order. Then it parses element 'bom' but the assert fails and the parser backtracks. Subsequent parsing of element 'data' uses the variable value which will be the default of 'UTF-16', and byte order as per the BOM.

(Some DFDL annotation syntax removed for clarity)

<xs:schema ...>

<dfdl:format encoding="{$myEncoding}" ... />

<dfdl:defineVariable name="myEncoding" type="xs:string" default="UTF-16BE"/>

<xs:element name="doc">
<xs:complexType>
<xs:sequence>
<xs:element name="bom" type="xs:hexBinary" minOccurs="0" dfdl:lengthKind="explicit" dfdl:length="2">
<dfdl:assert test="{. eq x'FFFE' or . eq x'FEFF'}" />
<dfdl:setVariable ref="myEncoding" value="{if . eq x'FFFE then 'UTF16-LE'}" />
</xs:element>
<xs:element name="data">
<xs:complexType>
<xs:sequence>
...
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>

<xs:schema>

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU