Hi Mike
I've been looking into IBM DFDL's treatment
of property dfdl:utf16Width. While we claim to support 'variable'
and have a few tests that use this, there is not the number of tests that
I would expect to test the property fully. The intent to support 'variable'
is clear in the code, though; for example, when parsing we check each char
for being part of a surrogate pair and adjust length accordingly. The code
uses java.nio.charset for its encoders & decoders, which we wrap in
our own class which notes whether utf16 is fixed or variable, but this
information is not passed to the encoder/decoder as there is no way to
do so. Hmm. We will add some more tests and see if everything is behaving.
Back to your original question, should
'variable' be an optional feature of the spec. I have discussed with implementation
team members and we think that is a sensible thing to do. To handle surrogates
does require extra code to be written, and for a minimal implementation
it should not be necessary to do that.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From:
Steve Hanson/UK/IBM
To:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
Cc:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date:
04/04/2017 11:56
Subject:
Re: [DFDL-WG]
Suggest should be optional feature of DFDL - dfdl:utf16Width='variable'
and other corner cases
Some light on action 291 - see the last
sentence of this extract from
the original errata document (experience doc 1):
3.9. Section 12.3.5, 7.3.1,
7.3.2. The spec originally allows lengthKind ‘pattern’ to be used
when the representation of the current element, or of a child element,
is binary, but imposes restrictions on the encoding that can be in force.
Clarify that the encoding property must
be defined for the element (else schema definition error), and that a decoding
processing error is possible if the match of the regex encounters data
that does not decode in that encoding, dependent on the setting of encodingErrorPolicy.
Remove section 12.3.5.1.
Same clarifications needed for testKind
”pattern” property for asserts and discriminators.
For consistency, the restriction
that a complex element of specified length and lengthUnits ‘characters’
must have children that are all text and that have the same encoding as
the complex element, is dropped
So that explains how IBM DFDL's error message
CTDV1524E came about, it was policing a restriction in the original GFD.174
spec, a restriction which no longer exists. IBM DFDL has not yet implemented
the erratum. It wasn't an extra IBM restriction.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From:
Steve Hanson/UK/IBM
To:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
Cc:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date:
14/09/2016 08:44
Subject:
Re: [DFDL-WG]
Suggest should be optional feature of DFDL - dfdl:utf16Width='variable'
and other corner cases
Actions 290 and 291 raised to investigate
further - see minutes.
Regards
Steve Hanson
IBM
Integration Bus, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From:
Steve Hanson/UK/IBM
To:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
Cc:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date:
13/09/2016 13:14
Subject:
Re: [DFDL-WG]
Suggest should be optional feature of DFDL - dfdl:utf16Width='variable'
and other corner cases
Mike
I am assuming that the processing for utf-16
'fixed' or 'variable' is entirely handled by ICU so there should be no
coding overhead.
IBM DFDL works ok for dfdl:lengthKind='explicit'
for an element of complex type with dfdl:lengthUnits='characters' and dfdl:encoding="utf-8".
But there are conditions the content of the complex type must satisfy otherwise
an SDE results, such as:
CTDV1524E : For a complex element, when
'lengthKind' is 'explicit' or 'prefixed', and 'lengthUnits' is characters,
all simple child elements must have text representation, 'lengthUnits'
set to 'characters' and the same encoding.
So we insist that the properties of the children
are consistent with the properties of the parent. If you recall,
IBM DFDL does all these kinds of validation checks in a pre-processing
phase.
That seems a pretty sensible rule but I am
not sure if the rule appears in the spec as such - I just had a quick look
but didn't spot anything.
So I guess I don't see a need for these things
to be optional features?
Regards
Steve Hanson
IBM
Integration Bus, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date:
10/08/2016 18:57
Subject:
[DFDL-WG] Suggest
should be optional feature of DFDL - dfdl:utf16Width='variable' and other
corner cases
Sent by:
"dfdl-wg"
<dfdl-wg-bounces@ogf.org>
Given the limited set of required encodings for a conforming
DFDL processor, I believe dfdl:utf16Width='variable' should be an optional
feature.
That's just consistency with what is optional already.
But it is also quite hard to implement.
There are other situations that are very hard to implement,
probably never used by real users, yet which are non optional in the spec:
I would suggest that dfdl:lengthKind='explicit' for elements
of complex type, with dfdl:lengthUnits='characters' and a variable-width
encoding like utf-8 is very problematic to implement. I am pretty sure
IBM DFDL has no implementation of this per email threads, and I know I
don't want to implement this in Daffodil even though we're trying to be
very comprehensive in the implementation eventually.
I think implementations should be free to just not implement
this. These sorts of cases often exist just because we're trying
to preserve some orthogonality of composition in the language. So it's
possible to do quite a few things that probably aren't ever needed by anyone,
that reflect ill-defined data formats, etc.
I'd rather not document a bunch of "non-conformances"
for Daffodil or other implementations for these sorts of things. I'd like
to say we don't implement them, but they're optional, and so that's allowed.
Comments?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com
Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU