IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson <smh@uk.ibm.com>
Cc:
DFDL-WG <dfdl-wg@ogf.org>
Date:
03/10/2018 23:00
Subject:
Part 1 - Re:
[DFDL-WG] Action 307 - Demonstrate implementation interoperability
This issue can be addressed with a minor change to the
DFDL specification.
When the type is xs:string, lengthUnits is
'characters', then the length in characters should take surrogate-pairs
found in the UTF-16 data, and count those as occupying 1 character.
This utf16Width='variable' feature of DFDL
should be optional, as Java JVM-based implementations will find this extremely
difficult to support, since JVM standard string representations cannot
represent individual characters with code points greater than 0xFFFF occupying
1 location in a string.
Daffodil does not implement this 'variable' behavior,
and we have no good pathway to do so. Hence, prefer to change the DFDL
spec to make this 'variable' optional. Only 'fixed' would be required.
I could support deprecating the whole property even.
SMH: This is already captured
by action 290, which is waiting for me to do some tests with IBM DFDL which
claims to have implemented this.
Issue: lengthUnits='characters' and variable-width
charset encodings
I believe this is required behavior. I also believe the
lack of support for this is missing from IBM's list of non-compliances.
I recall discussion that IBM DFDL requires a fixed width encoding in this
situation where lengthUnits is 'characters'. (Please correct me if
I am wrong.)
I suggest making this combination an optional feature
of the DFDL spec., would resolve the issue.
This complex feature was added to support naive data format
conversions where data originally had ascii encoding and lengthUnits 'bytes'
is changed to 'utf-8' with lengthUnits 'characters'. This is a rational
way to modernize a data format adding internationalization capability.
It however requires a significant change in runtime behavior because utf-8
characters occupy between 1 and 4 bytes per character.
SMH: IBM DFDL certainly supports
lengthUnits="characters" and encoding="UTF-8", which
is an example of this.
Optional Features that are Partially Implemented
The bigger set of concerns for interoperability
is the behavior of a DFDL processor for features that are optional by strict
interpretation of Section 21, but are implemented by a specific DFDL implementation,
but the implementation is partial. This is the subject of other email messages
however.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com
Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
On Tue, Sep 11, 2018 at 11:33 AM Steve Hanson <smh@uk.ibm.com>
wrote:
Action 307 was raised recently and first
task is for implementations to identify which core spec behaviour is not
implemented.
IBM DFDL
The following is the list of DFDL 1.0 spec core features that IBM DFDL
does not yet implement.
- dfdl:encodingErrorPolicy
"replace"
- dfdl:binaryBooleanTrueRep
with value empty string
- dfdl:assert on global
element and simple type
- dfdl:discriminator on
global element and simple type
- Multiple xs:appinfo elements
within each xs:annotation element
- When parsing, the distinction
between an element being 'missing', having an 'empty representation' and
having an 'absent representation', is not in accordance with the specification.
- When encoding is 'UTF-8' or 'UTF-16', byte order marks are not processed
The above lists are derived from information at https://www.ibm.com/support/knowledgecenter/en/SSMKHH_10.0.0/com.ibm.etools.mft.doc/df00150_.htm
and are those that apply to core spec features.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU