[DFDL-WG] Action 151 - BOM handling

5 Sep 2011

      Please find a proposal from the WG to simplify modeling Unicode documents 
that start with a BOM.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

-----------------------------------------------------------------------------------------------------------------------
Infoset
New property added to the document information item:

[unicodeByteOrderMark] String. When the encoding of the root element of 
the document is exactly UTF-8, UTF-16, or UTF-32 (or CCSID equivalent), 
the value indicates whether the document starts with a byte order mark 
(BOM). It there is a BOM then for UTF-8 encoding the value is 'UTF-8'; for 
UTF-16 encoding the value is 'UTF-16LE' or 'UTF-16BE'; for UTF-32 the 
value is 'UTF-32LE' or 'UTF-32BE'. If there is no BOM then the value is 
empty. When the encoding of the root element of the document is any other 
encoding, the value is empty. 

Grammar
The grammar production for the overall document changes to:

        Document = UnicodeByteOrderMark  Element 

Parsing
When the dfdl:encoding property of the root element is specified, and is 
exactly one of UTF-8, UTF-16, or UTF-32 (or CCSID equivalents), then a 
DFDL parser will look for the appropriate byte order mark (BOM) as the 
very first bytes in the data stream. 

UTF-8.  If a BOM is found then this is used to set the document 
information item 'unicodeByteOrderMark' property. If no BOM is found the 
parser takes no action. There is no need to model the BOM explicitly. 

UTF-16.  If a BOM is found then this is used to set the document 
information item 'unicodeByteOrderMark' property value, and all data with 
dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have 
the implied byte order . If no BOM is found then all data with 
dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have 
big-endian byte order. There is no need to model the BOM explicitly. 

UTF-32.  If a BOM is found then this is used to set the document 
information item 'unicodeByteOrderMark' property value, and all data with 
dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have 
the implied byte order . If no BOM is found then all data with 
dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have 
big-endian byte order. There is no need to model the BOM explicitly. 

When the dfdl:encoding property of the root element is specified, and is 
exactly one of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE (or CCSID 
equivalents), then a DFDL parser will not look for appropriate BOM. The 
byte order to use is implicit in the encoding. If a BOM does appear at the 
start of the data stream, then it must be explicitly modelled as such, 
otherwise if parsed as part of an xs:string it will be interpreted as a 
zero-width non-breaking space (ZWNBS) character.

The dfdl:byteOrder property is never used to establish the byte order for 
Unicode encodings. 

The parser never looks for a BOM at any other point in the data stream. If 
a BOM does appear at any other point, then it must be explicitly modelled, 
otherwise if parsed as part of an xs:string it will be interpreted as a 
ZWNBS character.

Unparsing
When the dfdl:encoding property of the root element is specified, and is 
exactly one of UTF-8, UTF-16 or UTF-32 (or CCSID equivalents), then a DFDL 
unparser will look in the infoset document information item for a byte 
order mark (BOM). 

UTF-8.  If the document information item 'unicodeByteOrderMark' property 
value is 'UTF-8', the UTF-8 BOM is output as the very first bytes in the 
data stream. If the property is empty then no BOM is output.  If the 
property has any other value, it is a processing error.There is no need to 
model the BOM explicitly. 

UTF-16.  If the document information item 'unicodeByteOrderMark' property 
is 'UTF-16LE' or 'UTF-16BE', the corresponding UTF-16 BOM is output as the 
very first bytes in the data stream, and all data with dfdl:encoding 
UTF-16 throughout the rest of the document will be output with the implied 
byte order . If the property is empty then no BOM is output, and all data 
with dfdl:encoding UTF-16 throughout the rest of the document are assumed 
to have big-endian byte order. If the property has any other value, it is 
a processing error. There is no need to model the BOM explicitly. 

UTF-32.  If the document information item 'unicodeByteOrderMark' property 
is 'UTF-32LE' or 'UTF-32BE', the corresponding UTF-32 BOM is output as the 
very first bytes in the data stream, and all data with dfdl:encoding 
UTF-32 throughout the rest of the document will be output with the implied 
byte order . If the property is empty then no BOM is output, and all data 
with dfdl:encoding UTF-32 throughout the rest of the document are assumed 
to have big-endian byte order. If the property has any other value, it is 
a processing error. There is no need to model the BOM explicitly.

When the dfdl:encoding property of the root element is specified, and is 
exactly one of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE (or CCSID 
equivalents), then a DFDL unparser will not look in the infoset document 
information item for a BOM and will not output a BOM. The byte order to 
use is implicit in the encoding. If a BOM does need to be output at the 
start of the data stream, then it must be explicitly modelled as such. 

The dfdl:byteOrder property is never used to establish the byte order for 
Unicode encodings. 

The unparser never outputs a BOM at any other point in the data stream. If 
a BOM needs to appear, then it must be explicitly modelled as such. 

-----------------------------------------------------------------------------

Explicit modelling of a BOM.

Below is a natural way to model a Unicode data stream with an optional 
BOM.  The parser starts with root element 'doc' and finds (scoped) 
dfdl:encoding is  'UTF-16BE' from the variable's default  so it does not 
look for a BOM automatically. Then it parses element 'bom' and sets the 
variable according to the value. If there is no BOM in the data then the 
assert fails and the parser backtracks (note use of minOccurs="0"). 
Subsequent parsing of element 'data' uses the variable value.

The model would also work if the variable's default had been 'UTF16'. The 
parser looks for the BOM explicitly in the stream, finds it, and sets the 
byte order. Then it parses element 'bom' but the assert fails and the 
parser backtracks. Subsequent parsing of element 'data' uses the variable 
value which will be the default of 'UTF-16', and byte order as per the 
BOM. 

(Some DFDL annotation syntax removed for clarity)

<xs:schema ...>

        <dfdl:format encoding="{$myEncoding}" ... />

        <dfdl:defineVariable name="myEncoding" type="xs:string" 
default="UTF-16BE"/>

        <xs:element name="doc">
                <xs:complexType>
                        <xs:sequence>
                                <xs:element name="bom" type="xs:hexBinary" 
minOccurs="0"  dfdl:lengthKind="explicit" dfdl:length="2"> 
                                        <dfdl:assert test="{. eq x'FFFE' 
or . eq x'FEFF'}" /> 
                                        <dfdl:setVariable ref="myEncoding" 
value="{if . eq x'FFFE then 'UTF16-LE'}" />
                                </xs:element>
                                <xs:element name="data">
                                        <xs:complexType>
                                                <xs:sequence>
                                                        ...
                                                </xs:sequence>
                                        </xs:complexType>
                                </xs:element>
                        </xs:sequence>
                </xs:complexType>
        </xs:element>

<xs:schema>

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU