Action 151 - BOM handling
Please find a proposal from the WG to simplify modeling Unicode documents
that start with a BOM.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
-----------------------------------------------------------------------------------------------------------------------
Infoset
New property added to the document information item:
[unicodeByteOrderMark] String. When the encoding of the root element of
the document is exactly UTF-8, UTF-16, or UTF-32 (or CCSID equivalent),
the value indicates whether the document starts with a byte order mark
(BOM). It there is a BOM then for UTF-8 encoding the value is 'UTF-8'; for
UTF-16 encoding the value is 'UTF-16LE' or 'UTF-16BE'; for UTF-32 the
value is 'UTF-32LE' or 'UTF-32BE'. If there is no BOM then the value is
empty. When the encoding of the root element of the document is any other
encoding, the value is empty.
Grammar
The grammar production for the overall document changes to:
Document = UnicodeByteOrderMark Element
Parsing
When the dfdl:encoding property of the root element is specified, and is
exactly one of UTF-8, UTF-16, or UTF-32 (or CCSID equivalents), then a
DFDL parser will look for the appropriate byte order mark (BOM) as the
very first bytes in the data stream.
UTF-8. If a BOM is found then this is used to set the document
information item 'unicodeByteOrderMark' property. If no BOM is found the
parser takes no action. There is no need to model the BOM explicitly.
UTF-16. If a BOM is found then this is used to set the document
information item 'unicodeByteOrderMark' property value, and all data with
dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have
the implied byte order . If no BOM is found then all data with
dfdl:encoding UTF-16 throughout the rest of the stream are assumed to have
big-endian byte order. There is no need to model the BOM explicitly.
UTF-32. If a BOM is found then this is used to set the document
information item 'unicodeByteOrderMark' property value, and all data with
dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have
the implied byte order . If no BOM is found then all data with
dfdl:encoding UTF-32 throughout the rest of the stream are assumed to have
big-endian byte order. There is no need to model the BOM explicitly.
When the dfdl:encoding property of the root element is specified, and is
exactly one of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE (or CCSID
equivalents), then a DFDL parser will not look for appropriate BOM. The
byte order to use is implicit in the encoding. If a BOM does appear at the
start of the data stream, then it must be explicitly modelled as such,
otherwise if parsed as part of an xs:string it will be interpreted as a
zero-width non-breaking space (ZWNBS) character.
The dfdl:byteOrder property is never used to establish the byte order for
Unicode encodings.
The parser never looks for a BOM at any other point in the data stream. If
a BOM does appear at any other point, then it must be explicitly modelled,
otherwise if parsed as part of an xs:string it will be interpreted as a
ZWNBS character.
Unparsing
When the dfdl:encoding property of the root element is specified, and is
exactly one of UTF-8, UTF-16 or UTF-32 (or CCSID equivalents), then a DFDL
unparser will look in the infoset document information item for a byte
order mark (BOM).
UTF-8. If the document information item 'unicodeByteOrderMark' property
value is 'UTF-8', the UTF-8 BOM is output as the very first bytes in the
data stream. If the property is empty then no BOM is output. If the
property has any other value, it is a processing error.There is no need to
model the BOM explicitly.
UTF-16. If the document information item 'unicodeByteOrderMark' property
is 'UTF-16LE' or 'UTF-16BE', the corresponding UTF-16 BOM is output as the
very first bytes in the data stream, and all data with dfdl:encoding
UTF-16 throughout the rest of the document will be output with the implied
byte order . If the property is empty then no BOM is output, and all data
with dfdl:encoding UTF-16 throughout the rest of the document are assumed
to have big-endian byte order. If the property has any other value, it is
a processing error. There is no need to model the BOM explicitly.
UTF-32. If the document information item 'unicodeByteOrderMark' property
is 'UTF-32LE' or 'UTF-32BE', the corresponding UTF-32 BOM is output as the
very first bytes in the data stream, and all data with dfdl:encoding
UTF-32 throughout the rest of the document will be output with the implied
byte order . If the property is empty then no BOM is output, and all data
with dfdl:encoding UTF-32 throughout the rest of the document are assumed
to have big-endian byte order. If the property has any other value, it is
a processing error. There is no need to model the BOM explicitly.
When the dfdl:encoding property of the root element is specified, and is
exactly one of UTF-16LE, UTF-16BE, UTF-32LE or UTF-32BE (or CCSID
equivalents), then a DFDL unparser will not look in the infoset document
information item for a BOM and will not output a BOM. The byte order to
use is implicit in the encoding. If a BOM does need to be output at the
start of the data stream, then it must be explicitly modelled as such.
The dfdl:byteOrder property is never used to establish the byte order for
Unicode encodings.
The unparser never outputs a BOM at any other point in the data stream. If
a BOM needs to appear, then it must be explicitly modelled as such.
-----------------------------------------------------------------------------
Explicit modelling of a BOM.
Below is a natural way to model a Unicode data stream with an optional
BOM. The parser starts with root element 'doc' and finds (scoped)
dfdl:encoding is 'UTF-16BE' from the variable's default so it does not
look for a BOM automatically. Then it parses element 'bom' and sets the
variable according to the value. If there is no BOM in the data then the
assert fails and the parser backtracks (note use of minOccurs="0").
Subsequent parsing of element 'data' uses the variable value.
The model would also work if the variable's default had been 'UTF16'. The
parser looks for the BOM explicitly in the stream, finds it, and sets the
byte order. Then it parses element 'bom' but the assert fails and the
parser backtracks. Subsequent parsing of element 'data' uses the variable
value which will be the default of 'UTF-16', and byte order as per the
BOM.
(Some DFDL annotation syntax removed for clarity)
participants (1)
-
Steve Hanson