Agreed that BOM support would be dropped
from DFDL 1.0 via erratum.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From:
Steve Hanson/UK/IBM
To:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
Cc:
DFDL-WG <dfdl-wg@ogf.org>
Date:
03/04/2019 10:44
Subject:
Re: Byte-order-marks
- was: Re: Part 1 - Re: [DFDL-WG] Action 307 - Demonstrate implementation
interoperability
Hi Mike
Found the thread on BOMs, so ignore
my earlier email.
I am leaning towards deprecation on the
following grounds:
- Only one customer that I know of ever
requested BOM processing for non-XML data (in 2010, for MRM, before IBM
DFDL available)
- BOM processing only applies to the
message as a whole, not to any embedded Unicode fragments, so support is
selective anyway
- It is possible to model an optional
BOM and use it to set a user-defined encoding variable which is then used
by the rest of the schema
I have a schema that models UTF16 BOM
and it successfully parses and unparses the 3 variants fine (no BOM present,
BOM for BE present, BOM for LE present).
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson <smh@uk.ibm.com>
Cc:
DFDL-WG <dfdl-wg@ogf.org>
Date:
16/10/2018 14:57
Subject:
Byte-order-marks
- was: Re: Part 1 - Re: [DFDL-WG] Action 307 - Demonstrate implementation
interoperability
To avoid changing behavior then, we will probably need
a property to turn on/off the BOM behavior which strips/generates BOM.
All implementations of DFDL that implement text (IBM, and Daffodil) currently
do not treat BOMs specially currently, neither stripping them nor generating
them.
I suggest we need byteOrderMarkPolicy="use/ignore",
with "ignore" meaning that the BOM is just treated as a character.
Implementation of byteOrderMarkPolicy="use" would be an optional
feature of DFDL. That way both IBM DFDL and Daffodil can be compliant without
implementing this.
(I'd like everything we've collectively been able to live
without thus far, that isn't needed for interoperability testing, to ultimately
get onto the optional features list.)
Or we can simply deprecate the functionality in the spec
and say BOMs must be modeled, and just strike the stuff from the next draft,
and post an example on how to model BOMs. It sounds heavy handed, but nobody
has asked for this feature (on the Daffodil project), and it was put into
the DFDL spec way back in the early days when we expected BOMs to be popular,
but they never caught on.
However, I'll acknowledge that IBM would be in a better
position to decide whether this feature is needed, given more users in
Asia and other places where UTF-16 may be more popular.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com
Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
On Wed, Oct 10, 2018 at 5:24 AM Steve Hanson <smh@uk.ibm.com>
wrote:
I think the main thing with BOMs is
that when support is added by an implementation, it should not break existing
behaviour for a document that starts with a BOM. So if a user had
a schema that explicitly modelled the BOM, or was treating BOM as a character
so it appeared in the infoset, then a BOM aware implementation should not
suddenly change that.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: Steve
Hanson <smh@uk.ibm.com>
Cc: DFDL-WG
<dfdl-wg@ogf.org>
Date: 09/10/2018
14:35
Subject: Re:
Part 1 - Re: [DFDL-WG] Action 307 - Demonstrate implementation interoperability
Very helpful Steve H., , thanks.
re: UTF-8 and BOM, for UTF-8, the BOM can be viewed as "just a character",
same as it is in UTF-16BE and UTF-16LE.
Only utf-16 unadorned has to actually look at, and in theory strip the
BOM if found. Nobody is implementing this, and it's not clear it matters
much.
Today I know that Daffodil just treats UTF-16 as meaning UTF-16BE.
Hence, I suggest we consider just making BOM processing optional in DFDL
and also make utf-16 (unadorned) optional - takes one small issue off of
being "standard compliant". This leaves the question of what
does "utf-16" unadorned do, and the answer I think is supposed
to be guided by BOM, but if that is unimplemented then the behavior is
"implementation defined" i.e., non-portable.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
On Tue, Oct 9, 2018 at 6:25 AM Steve Hanson <smh@uk.ibm.com>
wrote:
Mike, responses in-line
below.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: Steve
Hanson <smh@uk.ibm.com>
Cc: DFDL-WG
<dfdl-wg@ogf.org>
Date: 03/10/2018
23:00
Subject: Part
1 - Re: [DFDL-WG] Action 307 - Demonstrate implementation interoperability
I'm going to reply to this in a few parts.
With respect to:
- dfdl:binaryBooleanTrueRep
with value empty string
- dfdl:assert on global
element and simple type
- dfdl:discriminator on
global element and simple type
- Multiple xs:appinfo elements
within each xs:annotation element
I think these are minor non-compliances with the DFDL spec, and for interoperability
testing we can just revise schemas under test to not use these constructs.
SMH: Agree.
With respect to:
- When parsing, the distinction
between an element being 'missing', having an 'empty representation' and
having an 'absent representation', is not in accordance with the specification.
I think time will tell here, that is, there's nothing we can anticipate
having to do because of this as yet. If this non-compliance does not cause
interoperability problems for realistic and published DFDL schemas then
I wouldn't worry about it. Like IBM DFDL, Daffodil does not implement default
values during parsing, and that's a likely area where this issue of missing/empty/absent
has effect on behavior. It is quite possible that despite this lack of
conformance to the DFDL spec., interoperability testing would be successful.
SMH: IBM DFDL gives a runtime SDE when parsing if it a zero-length representation
is found for an occurrence AND the element has a default value That prevents
a behaviour change when support for default values when parsing is implemented.
Suggest Daffodil does same if it does not do so already. With that
in place, I think we are ok.
With respect to:
- When encoding is 'UTF-8' or 'UTF-16', byte order marks are not processed
Daffodil also does not implement byte-order-mark processing. We can dodge
this issue entirely if we make the UTF-16 charset (specifically UTF-16
without the BE or LE suffix) encoding an optional DFDL feature. That effectively
makes byte-order-mark processing also an optional feature, and then both
IBM DFDL and Daffodil would be compliant and interoperable.
SMH: UTF8 can also have a BOM so that does not solve the problem entirely.
Needs some more thought.
With respect to:
- dfdl:encodingErrorPolicy
"replace"
This one is harder. Daffodil doesn't implement encodingErrorPolicy='error'
so we have no common ground here for interoperability testing.
Making the entire encodingErrorPolicy property optional - meaning behavior
in the presence of encoding errors is implementation specified -
that's super undesirable to me.
I suspect that implementing encodingErrorPolicy 'error' will be necessary
for Daffodil. If we do that then IBM DFDL can continue to document the
lack of this missing required feature of DFDL, or we can make 'replace'
optional in the spec., or IBM could implement 'replace'.
SMH: This is top of the list of missing features for IBM DFDL. I have asked
in the past if this could be added as it's technically a regression when
compared to IIB's older text/binary parser (MRM). I will ask again.
Additional Non-portable/Problematic Required Features
I did an analysis of all DFDL properties, and those that must be implemented
to meet the minimum functionality that is not optional for a DFDL implementation
per Section 21 of the spec.
Starting from a list of all DFDL properties, I eliminated any specific
to unparsing, and then any that aren't relevant given something optional
in Section 21.
Here are the remaining properties I found. Restrictions on what values
of these properties are mentioned where their full functionality is considered
optional:
- length - integer values only
- lengthKind - explicit, implicit only
- lengthUnits - bytes or characters only
- representation - binary only
- byteOrder
- alignment - number or 'implicit'
- alignmentUnits - bytes only
- fillByte
- leadingSkip
- trailingSkip
- encoding - 'UTF-8'', 'UTF-16', 'UTF-16BE',
'UTF-16LE', 'ASCII', and 'ISO-8859-1'
- encodingErrorPolicy - (Already discussed
above, so not further discussed in this section)
- utf16Width - because UTF-16 is allowed for
encoding, 'variable' is problematic.
- textPadKind
- textTrimKind
- textStringJustification
- textStringPadCharacter
- binaryNumberRep - binary only
- binaryFloatRep - ieee only
- binaryBooleanTrueRep
- binaryBooleanFalseRep - IBM DFDL doesn't
allow empty string for this. (Minor.)
- binaryCalendarRep - binarySeconds, binaryMillseconds
only
- binaryCalendarEpoch
- occursCountKind - fixed only
- occursCount - integer only
Looking
at this list, there is only 1 additional issue to portability/interoperability
this raises today given what I know about the Daffodil implementation and
the IBM implementation.
Issue: utf16Width='variable'
This issue can be addressed with a minor change to the DFDL specification.
When the type is xs:string, lengthUnits is 'characters', then the length
in characters should take surrogate-pairs found in the UTF-16 data, and
count those as occupying 1 character.
This utf16Width='variable' feature of DFDL should be optional, as Java
JVM-based implementations will find this extremely difficult to support,
since JVM standard string representations cannot represent individual characters
with code points greater than 0xFFFF occupying 1 location in a string.
Daffodil does not implement this 'variable' behavior, and we have no good
pathway to do so. Hence, prefer to change the DFDL spec to make this 'variable'
optional. Only 'fixed' would be required. I could support deprecating the
whole property even.
SMH: This is already captured by action 290, which is waiting for me to
do some tests with IBM DFDL which claims to have implemented this.
Issue: lengthUnits='characters' and variable-width charset encodings
I believe this is required behavior. I also believe the lack of support
for this is missing from IBM's list of non-compliances. I recall discussion
that IBM DFDL requires a fixed width encoding in this situation where lengthUnits
is 'characters'. (Please correct me if I am wrong.)
I suggest making this combination an optional feature of the DFDL spec.,
would resolve the issue.
This complex feature was added to support naive data format conversions
where data originally had ascii encoding and lengthUnits 'bytes' is changed
to 'utf-8' with lengthUnits 'characters'. This is a rational way
to modernize a data format adding internationalization capability. It however
requires a significant change in runtime behavior because utf-8 characters
occupy between 1 and 4 bytes per character.
SMH: IBM DFDL certainly supports lengthUnits="characters" and
encoding="UTF-8", which is an example of this.
Optional Features that are Partially Implemented
The bigger set of concerns for interoperability is the behavior of a DFDL
processor for features that are optional by strict interpretation of Section
21, but are implemented by a specific DFDL implementation, but the implementation
is partial. This is the subject of other email messages however.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
On Tue, Sep 11, 2018 at 11:33 AM Steve Hanson <smh@uk.ibm.com>
wrote:
Action 307 was raised recently and first task is for implementations to
identify which core spec behaviour is not implemented.
IBM DFDL
The following is the list of DFDL 1.0 spec core features that IBM DFDL
does not yet implement.
- dfdl:encodingErrorPolicy
"replace"
- dfdl:binaryBooleanTrueRep
with value empty string
- dfdl:assert on global
element and simple type
- dfdl:discriminator on
global element and simple type
- Multiple xs:appinfo elements
within each xs:annotation element
- When parsing, the distinction
between an element being 'missing', having an 'empty representation' and
having an 'absent representation', is not in accordance with the specification.
- When encoding is 'UTF-8' or 'UTF-16', byte order marks are not processed
The above lists are derived from information at https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/df00150_.htm
and are those that apply to core spec features.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU