Please find a proposal from the WG to
simplify modeling Unicode documents that start with a BOM.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
-----------------------------------------------------------------------------------------------------------------------
Infoset
New property added to the document information
item:
[unicodeByteOrderMark] String.
When the encoding of the root element of the document is exactly UTF-8,
UTF-16, or UTF-32 (or CCSID equivalent), the value indicates whether the
document starts with a byte order mark (BOM). It there is a BOM then for
UTF-8 encoding the value is 'UTF-8'; for UTF-16 encoding the value is 'UTF-16LE'
or 'UTF-16BE'; for UTF-32 the value is 'UTF-32LE' or 'UTF-32BE'. If there
is no BOM then the value is empty. When the encoding of the root element
of the document is any other encoding, the value is empty.
Grammar
The grammar production for the overall
document changes to:
Document = UnicodeByteOrderMark
Element
Parsing
When the dfdl:encoding property of
the root element is specified, and is exactly one of UTF-8, UTF-16, or
UTF-32 (or CCSID equivalents), then a
DFDL parser will look for the appropriate byte order mark (BOM) as the
very first bytes in the data stream.
- UTF-8. If a BOM is found then
this is used to set the document information item 'unicodeByteOrderMark'
property. If no BOM is found the parser takes no action. There is no need
to model the BOM explicitly.
- UTF-16. If a BOM is found then
this is used to set the document information item 'unicodeByteOrderMark'
property value, and all data with dfdl:encoding UTF-16 throughout the rest
of the stream are assumed to have the implied byte order . If no BOM is
found then all data with dfdl:encoding UTF-16 throughout the rest of the
stream are assumed to have big-endian byte order. There is no need to model
the BOM explicitly.
- UTF-32. If a BOM is found then
this is used to set the document information item 'unicodeByteOrderMark'
property value, and all data with dfdl:encoding UTF-32 throughout the rest
of the stream are assumed to have the implied byte order . If no BOM is
found then all data with dfdl:encoding UTF-32 throughout the rest of the
stream are assumed to have big-endian byte order. There is no need to model
the BOM explicitly.
When the dfdl:encoding property of
the root element is specified, and is exactly one of UTF-16LE, UTF-16BE,
UTF-32LE or UTF-32BE (or CCSID equivalents), then a
DFDL parser will not look for appropriate BOM. The byte order to
use is implicit in the encoding. If a BOM does appear at the start of the
data stream, then it must be explicitly modelled as such, otherwise if
parsed as part of an xs:string it will be interpreted as a zero-width non-breaking
space (ZWNBS) character.
The dfdl:byteOrder property is never
used to establish the byte order for Unicode encodings.
The parser never looks for a BOM at
any other point in the data stream. If a BOM does appear at any other point,
then it must be explicitly modelled, otherwise if parsed as part of an
xs:string it will be interpreted as a ZWNBS character.
Unparsing
When the dfdl:encoding property of
the root element is specified, and is exactly one of UTF-8, UTF-16 or UTF-32
(or CCSID equivalents), then a DFDL
unparser will look in the infoset document information item for a byte
order mark (BOM).
- UTF-8. If the document information
item 'unicodeByteOrderMark' property value is 'UTF-8', the UTF-8 BOM is
output as the very first bytes in the data stream. If the property is empty
then no BOM is output. If the property has any other value, it is
a processing error.There is no need to model the BOM explicitly.
- UTF-16. If the document information
item 'unicodeByteOrderMark' property is 'UTF-16LE' or 'UTF-16BE', the corresponding
UTF-16 BOM is output as the very first bytes in the data stream, and all
data with dfdl:encoding UTF-16 throughout the rest of the document will
be output with the implied byte order . If the property is empty then no
BOM is output, and all data with dfdl:encoding UTF-16 throughout the rest
of the document are assumed to have big-endian byte order. If the property
has any other value, it is a processing error. There is no need to model
the BOM explicitly.
- UTF-32. If the document information
item 'unicodeByteOrderMark' property is 'UTF-32LE' or 'UTF-32BE', the corresponding
UTF-32 BOM is output as the very first bytes in the data stream, and all
data with dfdl:encoding UTF-32 throughout the rest of the document will
be output with the implied byte order . If the property is empty then no
BOM is output, and all data with dfdl:encoding UTF-32 throughout the rest
of the document are assumed to have big-endian byte order. If the property
has any other value, it is a processing error. There is no need to model
the BOM explicitly.
When the dfdl:encoding property of
the root element is specified, and is exactly one of UTF-16LE, UTF-16BE,
UTF-32LE or UTF-32BE (or CCSID equivalents), then a
DFDL unparser will not look in the infoset document information
item for a BOM and will not output a BOM. The byte order to use
is implicit in the encoding. If a BOM does need to be output at the start
of the data stream, then it must be explicitly modelled as such.
The dfdl:byteOrder property is never
used to establish the byte order for Unicode encodings.
The unparser never outputs a BOM at
any other point in the data stream. If a BOM needs to appear, then it must
be explicitly modelled as such.
-----------------------------------------------------------------------------
Explicit modelling
of a BOM.
Below is a natural way to model a Unicode
data stream with an optional BOM. The parser starts with root element
'doc' and finds (scoped) dfdl:encoding is 'UTF-16BE' from the variable's
default so it does not look for a BOM automatically. Then it parses
element 'bom' and sets the variable according to the value. If there is
no BOM in the data then the assert fails and the parser backtracks (note
use of minOccurs="0"). Subsequent parsing of element 'data' uses
the variable value.
The model would also work if the variable's
default had been 'UTF16'. The parser looks for the BOM explicitly in the
stream, finds it, and sets the byte order. Then it parses element 'bom'
but the assert fails and the parser backtracks. Subsequent parsing of element
'data' uses the variable value which will be the default of 'UTF-16', and
byte order as per the BOM.
(Some DFDL annotation syntax removed
for clarity)
<xs:schema ...>
<dfdl:format
encoding="{$myEncoding}" ... />
<dfdl:defineVariable
name="myEncoding" type="xs:string" default="UTF-16BE"/>
<xs:element
name="doc">
<xs:complexType>
<xs:sequence>
<xs:element name="bom" type="xs:hexBinary"
minOccurs="0" dfdl:lengthKind="explicit" dfdl:length="2">
<dfdl:assert
test="{. eq x'FFFE' or . eq x'FEFF'}" />
<dfdl:setVariable
ref="myEncoding" value="{if . eq x'FFFE then 'UTF16-LE'}"
/>
</xs:element>
<xs:element name="data">
<xs:complexType>
<xs:sequence>
...
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
<xs:schema>
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU