Mike,

I believe that this is all part of the subject of deferred action 242. It sounds like we should undefer the action as it is impacting the work on the Daffodil serializer.

I have the last email exchanges for action 242, from April 2014. I can re-send them.

Regards

Steve Hanson
IBM Integration Bus, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890

From: Mike Beckerle <mbeckerle.dfdl@gmail.com>
To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>
Date: 19/05/2016 17:25
Subject: [DFDL-WG] clarifications needed?: dfdl:contentLength function and dfdl:valueLength function on empty and literal nil representations, and escaping
Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org>

The dfdl:contentLength function is defined in terms of the SimpleContent or ComplexContent regions of the grammar.

Let's just look at Simple types for a sec.

We do not specify what the dfdl:contentLength is for an element of SimpleType which has the SimpleLiteralNilElementRep or SimpleEmptyElementRep.

I suggest the value should be zero for SimpleEmptyElementRep. When parsing, an empty element by definition has no content. The fact that a default value might be inserted because of the empty representation should not change the fact that there was no content. When unparsing, SimpleEmptyElementRep can occur if an empty string is the value of a string-valued element, or an empty byte array is the value of a hexBinary element. The grammar is just stipulating the different treatment of initiator/terminator for these special cases of empty things. The content is length zero.

But consider the round-trip scenario. We parse data to the infoset. During parsing the dfdl:contentLength of an element having SimpleEmptyElementRep is zero. A default value is inserted. Now we unparse this same infoset. The default value's representation very well may be SimpleNormalRep, with non-zero dfdl:contentLength.

I claim this is ok. This is just another case where some data formats don't round trip unchanged. It does add an implementation headache, which is if the contentLength is cached on the infoset item, you need separate cache locations to be used when parsing and when unparsing.

For SimpleLiteralNilElementRep, it should be the length of the NilLiteralCharacters or NilElementLiteralContent regions. (Note: there's the word "Content" implying that we think of the nil literal representation as content. ) This applies to both parsing and unparsing.

For elements of complex type, I think for both ComplexLiteralNilElementRep and ComplexEmptyElementRep, the dfdl:contentLength should be zero when parsing. When unparsing, again a complex default may be created (because default values for interior elements of the complex type might be filled in as part of the augmented infoset.) and the dfdl:contentLength might not be zero if these default values have non-zero content length. Again I think this is ok.

For dfdl:contentLength, we should clarify that the length should also include the contributions of any escape characters, escape-escape characters, and escapeBlockStart/End characters. (This is implied, because such characters are in the "value" regions of simple types, and value regions are always contained in the content region, but I think the clarification is still helpful.

Similarly we need to clarify what dfdl:valueLength does.

For SimpleEmptyElementRep the dfdl:valueLength should be zero.
For SimpleLiteralNilElementRep, the dfdl:valueLength should be zero, because a nilled element has no value.

The corner case of SimpleLiteralNilElementRep for a nillable simple element of type xs:string - since a literal nil representation and a string value are ambiguous, should be handled by calling dfdl:contentLength instead of dfdl:valueLength. So a nillable string element with literal nil nilValue="nil", should have dfdl:valueLength of zero, but dfdl:contentLength (in characters) of 3. Same element but not nilled, containing the string "nil" as its value, would have dfdl:valueLength of 3 (characters), and dfdl:contentLength of 3 (characters).

For complex type elements, dfdl:valueLength is already defined to be the same as dfdl:contentLength.

For elements that are not represented (that is, elements that have the dfdl:inputValueCalc property on them), I believe both dfdl:valueLength and dfdl:contentLength should cause an SDE, as this has to be an error on the part of the schema author. (An argument can be made that these should return zero however. See next paragraph.)

Note however that these functions can be called on elements of complex type that contain elements that are not represented. Such contained non-represented elements contribute zero to the content length in all cases. (Consistency with this is why calling dfdl:valueLength or dfdl:contentLength directly on a non-represented element might want to return zero, instead of SDE.)

dfdl:valueLength is already specified to exclude the length of padding characters that are trimmed/added.
I believe we should explicitly state that it *includes* the length of escape, escape-escape, and escapeBlockStart/End characters.

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU