a) We should also check for BOM when
encoding is UTF6/32-LE/BE when parsing and give a processing error if one
is found (this is stated by the Unicode standard).
b) "The
UnicodeSignature field is optional. It is only of non-zero size when the
encoding is exactly one of UTF-8, UTF-16, or UTF-32, and the following
Element begins with textual data (meaning it is of type string, or any
other type with representation=”text”, or when complex, its first child
is always textual data inductively)."
We have to very careful with
these words. It does not make sense to say "when the encoding is xxx"
outside of the context of a DFDL schema object, and also encoding applies
to delimiters and not just representation. I think the sentence needs to
be simplified to:
"The
UnicodeSignature field is optional. It is only of non-zero size when the
dfdl:encoding property of the root element is specified, and is exactly
one of UTF-8, UTF-16, or UTF-32"
This also allows the modeller to be
flexible, for example, he can model the first section of a Unicode document
as a BLOB if he wants to view the content as-is, while correctly parsing
subsequent sections.
c) "exactly
one of UTF-8, UTF-16, or UTF-32"
. We must also allow the CCSID equivalents so that's 1200, 1208, and ???
d) "If
dfdl:byteOrder is not defined, then bigEndian is assumed throughout the
document. ". A BOM trumps
byteOrder and makes the encoding effectively UTF16/32-LE/BE throughout
document - fine. If no BOM then we use byteOrder - but note that this is
on a per object basis. I therefore don't think that absence
of byteOrder means bigEndian throughout document, it must be handled on
a per object basis.
Tim's suggestion
I am actually uncomfortable with absence
of byteOrder defaulting to big-endian. I know it's what the Unicode standard
says, but properties magically defaulting is something DFDL tries to avoid.
I discussed this point with Tim, and he suggested that for UTF16/32
we never look at byteOrder property. This implies the following:
- If there is no BOM then UTF16/32-BE
is used as per standard.
- To model embedded UTF16/32 littleEndian
strings you must explicitly set UTF16/32-LE on those objects
- To model an entire BOM-less UTF16/32
littleEndian document you must explicitly set UTF16/32-LE in scope
The net is that byteOrder only affects
simple elements with binary representation.
This is ok when parsing but what about
unparsing? We need some way of knowing what byte order to generate.
This can be solved by adding a BOM to
the Document infoset item, which also fixes Tim's issue with BOM preservation
below.
In terms of behaviour, the output infoset
BOM is symmetric to the input data BOM.
It also means we don't need a new property
to control BOM on output, that can be inferred if the infoset BOM is set.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: | Tim Kimber/UK/IBM@IBMGB |
To: | "Mike Beckerle" <mbeckerle.dfdl@gmail.com> |
Cc: | dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org |
Date: | 25/08/2011 09:27 |
Subject: | Re: [DFDL-WG] Action 151 - BOM disposal |
Sent by: | dfdl-wg-bounces@ogf.org |
Bytes | Encoding Form |
00 00 FE FF | UTF-32, big-endian |
FF FE 00 00 | UTF-32, little-endian |
FE FF | UTF-16, big-endian |
FF FE | UTF-16, little-endian |
EF BB BF | UTF-8 |
From: | "Mike Beckerle" <mbeckerle.dfdl@gmail.com> |
To: | Tim Kimber/UK/IBM@IBMGB, Steve Hanson/UK/IBM@IBMGB |
Cc: | "'Stephanie Fetzer'" <sfetzer@us.ibm.com> |
Date: | 23/08/2011 03:29 |
Subject: | RE: Fw: BOM disposal |
From: | Mike Beckerle <mbeckerle.dfdl@gmail.com> |
To: | Steve Hanson/UK/IBM@IBMGB |
Date: | 17/08/2011 23:00 |
Subject: | RE: BOM disposal |
I think it is OK to add BOM control but I think the reference to utf8 and
BOMs is wrong. We should never encode a BOM into utf8 and if a zwnbs is
encoded in utf8 even as the first codepoint it should not ever be considered
to be a BOM and should always go into the infoset.
----- Forwarded by Steve Hanson/UK/IBM on 18/08/2011 08:58 -----
From: | Steve Hanson/UK/IBM |
To: | "Mike Beckerle" <mbeckerle.dfdl@gmail.com> |
Cc: | "'Stephanie Fetzer'" <sfetzer@us.ibm.com>, Tim Kimber/UK/IBM@IBMGB |
Date: | 17/08/2011 17:57 |
Subject: | RE: BOM disposal |
Hi Mike
I've read below and also the historical e-mail that you forwarded.
I am happy that when U+FEFF is encountered at any place other than the
start of a DFDL described document, then it is interpreted as ZWNBS.
But I am concerned that we are making life harder than it need be for modellers
who have to handle Unicode documents that start with a BOM.
Take the simple example of wanting to read in a file in one encoding, look
at the DFDL infoset in order to make some routing decision, and then send
it on in a different encoding. As the spec stands, for all encodings except
those with a BOM the modeller can create a single DFDL model that uses
external variable $encoding to control the output. But once you make
one of the document's encoding Unicode with the possibility of a BOM then
the model has to change to accomodate this in a non-trivial way. That's
not very usable, and further I don't think it is in the spirit of another
paragraph in RFC 2781...
All applications that process text with the "UTF-16" charset
label
MUST be able to read at least the first two octets of the text and be
able to process those octets in order to determine the serialization
order of the text. Applications that process text with the "UTF-16"
charset label MUST NOT assume the serialization without first
checking the first two octets to see if they are a big-endian BOM, a
little-endian BOM, or not a BOM. All applications that process text
with the "UTF-16" charset label MUST be able to interpret both
big-
endian and little-endian text.
Proposal:
On parsing: If encoding is set when starting
to process the model, and is UTF-8, UTF-16, UTF-32 (including BE/LE variants)
then the DFDL parser looks for a BOM.
If a BOM is found at the very start of the document then it is not
added to the infoset, and:
- UTF-16, UTF-32: The DFDL byteOrder property is ignored for text data
of those encodings throughout the rest of the document and the BOM implies
the byte order
- UTF-8: The BOM is ignored as byte order is not used anyway.
- LE/BE variants. Processing error as this contravenes the Unicode standard..
If there is no BOM then byteOrder property behaves as currently stated
for UTF-16 and UTF-32.
On unparsing: If encoding is set when starting to process the model, and
is UTF-8, UTF-16 or UTF-32 (excluding BE/LE variants), then the DFDL unparser
optionally outputs a BOM, under the control .of a new document-level property
**, documentOutputBOM = yes/no. The BOM that is output depends on the setting
of byteOrder.
There is one issue with this. I deliberately used the phrase 'if encoding
is set when starting to process the model'. We have to define what this
means. DFDL encoding applies to all text elements and all objects that
have text delimiters. One option is to say that BOM processing only takes
place if encoding is actually to be used by the first element in the model.
So if I started my data with binary data that did not have an initiator
then no BOM processing would take place. Another option is to say that
BOM processing only takes place if there is a default dfdl:format in the
xsd with encoding set (then you can imagine the BOM as an implicit hidden
optional element that gets encoding from scope).
** (We already document level properties - documentFinalTerminatorCanBeMissing).
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: | "Mike Beckerle" <mbeckerle.dfdl@gmail.com> |
To: | Tim Kimber/UK/IBM@IBMGB |
Cc: | Steve Hanson/UK/IBM@IBMGB, "'Stephanie Fetzer'" <sfetzer@us.ibm.com> |
Date: | 15/08/2011 21:54 |
Subject: | RE: BOM disposal |
I stand corrected on the BOM character. This ZWNBS stuff means it *is*
a character regardless of the Unicode folks having deprecated it (see http://en.wikipedia.org/wiki/Zero-width_non-breaking_space
) , or their goal of BOMs somehow being non-characters.
Though my guess is that it mostly would come up because UTF-16 with BOM
was converted to UTF-8, with the BOM at the front converted to the UTF-8
encoding of a BOM. Concatenate some of these, and you’ll have ZWNBS characters
embedded in the string.
I think there is a flock more cases beyond the ones Tim enumerated having
to do with whether you remove the BOM or it takes up space in the string.
E.g., if I have fixed length data with properties that say there is an
optional BOM, is that data now variable length? I’d rather not go there.
If I ask the length in characters of a string, do I count BOMs or not?
Either way, the point is that there is good reason to just treat these
BOM/ZWNBS as characters, and to just fix the language in the spec about
UTF-8 BOMs, which is just fixing a turn of phrase.
Stripping these characters out, that’s a calculation an application can
easily do. (I could be talked into an XPath function in DFDL to do exactly
this.)
The 2nd paragraph about BOMs in the spec mentions they can be
modeled. I believe the BOM-based behaviors described in Tim’s mail can
all be modeled relatively easily as separate elements. They can then compute
the value of the byteOrder property with an expression that references
the elements. (I am assuming we allow byteOrder to be computed…. ). To
be concrete about it:
E.g.,
<sequence>
<element name=”bom1” type=”byte” dfdl:representation=’binary’
Dfdl:outputValueCalc=”{0xFE}”/>
<element name=”bom2” type=”byte” dfdl:representation=’binary’
Dfdl:outputValueCalc=”{0xFF}”/>
<element name=”data” type=”string” dfdl:encoding=”utf-16”
Dfdl:byteOrder=”{ if (../bom1 = 0xFE and ../bom2 = 0xFF) then ‘bigEndian’
Else if (../bom1 = 0xFF
and ../bom2 = 0xFE then ‘littleEndian’
Else error(‘no BOM found’)
}”
/>
</sequence>
One could even create a situation where BOM’s are accepted and tolerated:
<choice>
<…. The above sequence is one arm of the choice …>
<element name=”data” type=”string” dfdl:encoding=”utf-16be”/>
</choice>
This would cause a BOM to be accepted and used if present, and default
to bigEndian otherwise. Output would always be bigEndian.
With some clever use of variables and type definitions, I suspect this
can even be made reasonably compact.
These things are clumsy, but the alternative is more properties, and of
all the cases Tim enumerated, we’re not even sure we have them all, or
if anyone will use them.
Some much earlier DFDL draft had a unicodeByteOrderMarkPolicy property,….
I believe it was dropped for lack of clarity on exactly what the use cases
needed to be. It was like ‘prohibited’ ‘tolerated’ ‘required’ ‘ignored’
‘generated’ or some enumeration like that.
…mikeb
From: Tim Kimber [mailto:KIMBERT@uk.ibm.com]
Sent: Wednesday, July 27, 2011 4:32 PM
To: mbeckerle.dfdl@gmail.com
Cc: Steve Hanson; Stephanie Fetzer
Subject: BOM disposal
Key points about BOMs are:
- For all Unicode encodings, the "Zero Width Non-breaking Space"
character corresponds to the byte sequence of a BOM, but...
- a BOM is not considered to be a part of the data
My own assumptions about BOMs are:
- some input documents will have a BOM by accident, just because the application
that wrote it did not explicitly tell the encoder to omit the BOM.
- some users will expect a BOM at the start of an input document to be
honoured
- most users will be surprised if they get a ZWNBSP in the info set. Some
may even get a little annoyed if they find that they cannot prevent it,
because the Unicode specification is pretty clear that BOMs are not data.
I think we need to modify the DFDL rules about handling of BOMs. I don't
have all the answers, but I do think the following scenarios are likely
to crop up:
Parsing:
a) there is a BOM at the start of the input document.{1} The user
wants the DFDL parser to act as though the dfdl:encoding external variable
had been set to the encoding implied by the BOM.
b) there is sometimes a BOM at the start of the input document. The character
encoding is defined by the schema so the BOM is redundant. The user doesn't
care whether it is there or not, and would like DFDL to completely ignore
it.
c) at some point within the document ( not at the start ) there is a BOM
at the beginning of an element. The user wants the BOM to be ignored.
d) at some point within the document ( not at the start ) there is a BOM
at the beginning of an element. The user wants the encoding of the element
to be defined by the BOM
e) the user wants a BOM to be treated exactly like an ordinary character
( probably with the aim of ensuring that the document round-trips without
losing BOMs ).
Serializing
f) the user always wants the output document to start with a BOM when the
encoding is one of the Unicode encodings
g) the user wants an element within the document to start with a BOM that
signals its encoding
Feel free to come with other scenarios if you think I've missed any.
{1} I think I've done quite well to avoid any Monty Python 'Life of
Brian' references so far...
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
----- Forwarded by Tim Kimber/UK/IBM on 27/07/2011 20:59 -----
From: Steve
Hanson/UK/IBM
To:
mbeckerle.dfdl@gmail.com
Cc: Tim Kimber/UK/IBM@IBMGB
Date: 27/07/2011
19:15
Subject: OGF
DFDL WG Call Agenda 2011-08-09
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU