From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>,
Date:
11/04/2014 14:04
Subject:
Re: [DFDL-WG]
Action 242 - valueLength and contentLength function
wording
Sent by:
dfdl-wg-bounces@ogf.org
Revised Action 242 proposed changes word doc attached.
I have incorporated the discussion in this thread (I hope.) Please review.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
This language is consistent with what we say for lengthKind
pattern in section 12.3.5:
"When unparsing, the dfdl:valueLength of a complex
type element when the length units is 'characters' is computed as if the
entire structure was unparsed into a temporary data stream beginning at
position 1, and then this data stream is considered to be text in the character
set encoding specified by the dfdl:encoding property, regardless of the
actual representation of the complex type element or the elements contained
within it. The number of characters in this temporary data stream is the
value length of the complex type."
The behavior of the IBM DFDL implementation for valueLength
is as described is consistent with the above, excepting that it will not
detect a decode error, and it gives an SDE (?) if the encoding is not fixed
width.
Since we have decided not to require that a complex type
element is recursively all text all the way down, I believe we have to
tolerate implementations having different behaviors in the potentially
meaningless cases where there is binary data or encoding changes in the
complex type. So I would add to the above suggested language this:
"However, if creation of this data stream would cause
an encoding error, or parsing of this data stream as characters would cause
a decoding error, then the behavior and return value of dfdl:valueLength
are implementation dependent."
Looking at the DFDL spec, I am concerned that we never
really say what we mean by the "length of the ComplexContent region."
(Last sentence before Table 7 in section 12.3.7) Section 12.3.7.3 doesn't
do it. The dfdl:valueLength function may be the first place where we have
to actually say how the various sub-regions contribute to the ComplexContent
region's length.
I believe this is the obvious "sum of length of all contained regions",
but keep in mind that alignment region lengths will vary depending on the
starting alignment, so the length is, in general, dependent on the position
within the bit stream.
Hence when unparsing we have to specify that the dfdl:valueLength is measured
as if the ComplexContent region started at position 1 (as I did above)
so that internal alignment regions can be given meaningful lengths.
The general clarification should be added to 12.3.7.3, or to section 12.3.7
immediately before section 12.3.7.1. Something like this:
"The length of the ComplexContent region is the sum
of the lengths of the contained regions. However, note that alignment regions
inside the ComplexContent may be of different lengths depending on the
ComplexContent's starting position alignment."
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
On Mon, Mar 24, 2014 at 11:34 AM, Andrew Edwards <andy.edwards@uk.ibm.com>
wrote:
Steve (et al) - Resending as the last
one bounced.
I'll usurp Tim and respond :)
Currently the IBM implementation insists on using a fixed-length encoding
and returns an "unsupported" error message for a variable width
encoding. With a fixed width encoding, we "do the maths"
using the bytes-per-character and the bytes written by this complex element.
Re:
[DFDL-WG] Action 242 - valueLength and contentLength function
wordingLink
Note errata 3.9, my bolding:
"3.9. Section 12.3.5, 7.3.1, 7.3.2. The spec originally
allows lengthKind ‘pattern’ to be used when the representation of the
current element, or of a child element, is binary, but imposes restrictions
on the encoding that can be in force.
Clarify that the encoding property must be defined for the element (else
schema definition error), and that a decoding processing error is possible
if the match of the regex encounters data that does not decode in that
encoding, dependent on the setting of encodingErrorPolicy. Remove section
12.3.5.1.
Same clarifications needed for testKind ”pattern” property for asserts
and discriminators.
For consistency, the restriction that a complex element of specified length
and lengthUnits ‘characters’ must have children that are all text and
that have the same encoding as the complex element, is dropped."
That's the restriction that I was referring to in my comment below. I
can see why it was dropped - basically the parser now just tries to decode
n characters using the complex element's encoding (and encodingErrorPolicy).
We could apply the same principle for dfdl:valueLength & dfdl:contentLength
- you build the stream from the bottom up, and then decode it using the
complex element's encoding (and encodingErrorPolicy ?) to get the length
in characters.
Note that's how unparsing for lengthKind 'prefixed' with lengthUnits 'characters'
would work as well - the spec just says "For
a complex element, the length is that of the ComplexContent region"
which is not sufficient (12.3.4). Similar deal for lengthKind 'explicit'
- in order to know the size in chars of ElementUnused the unparser
needs to know the size in chars of the data first (12.3.7.3).
(Of course, for a fixed width encoding, you don't need to decode, you can
just do the maths, but for the general case you need to decode. Also just
doing the maths does not take encodingErrorPolicy into account).
23.5.3.1. Value length is only a function of the dfdl:encoding property
if the element has a text representation. Not sure this needs to be (re)stated
here.
23.5.3.1. "The value length
is computed from the DFDL infoset value, ignoring the dfdl:length or dfdl:textOutputMinLength
property. Other DFDL properties which affect the length of a text or binary
representation are respected, it is only an explicit length which is ignored."
Last sentence is too imprecise - should be phrased in terms of the
grammar.
23.5.3.1. "If the second
argument is 'characters' then the element must have text representation
and it is a schema definition error otherwise".
Yes but only for a simple type, so should be qualified.
23.5.3.1. "If the second
argument, giving the length units, is 'characters', then recursively, this
complex type element must have text representation throughout all its contained
elements and framing, all of which must also use a uniform character set
encoding." I
can't see that restriction elsewhere in the spec when it talks about length
of ComplexContent and lengthUnits 'characters' - I was expecting it to
be in section 12.3.4 or 12.3.7.3 which face the same issue - but it isn't.
Did we decide not to have this restriction? Without such a restriction,
how does the unparser come up with a meaningful length (unless it re-parses)?
(Tim - what does IBM DFDL do here?) What about delimiters
and padding of children that use %#r entities?
23.5.3.2. The points in 23.5.3.1 about escape characters, length as a function
of encoding, and bottom up for complex elements, apply equally to 23.5.3.2.
It might be easier just to say in 23.5.3.2 that dfdl:contentLength
for complex elements is same as dfdl:valueLength, and for simple elements
differs only by the additional inclusion of LeftPadding and RightPadOrFill
regions.
Also noted in passing:
Specified length - An item has specified length when dfdl:lengthKind
is "implicit", "explicit", or "prefixed".
should be
Specified length - An element has specified length when dfdl:lengthKind
is "implicit" (simple type only), "explicit", or "prefixed".
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU