Mike

I realise that this is the first of your two e-mails on this subject, but I've added some comments anyway to keep things in context...more to come on your 2nd mail.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From:	"Mike Beckerle" <mbeckerle.dfdl@gmail.com>
To:	Tim Kimber/UK/IBM@IBMGB
Cc:	Steve Hanson/UK/IBM@IBMGB
Date:	11/06/2011 01:56
Subject:	RE: A selection of example data formats

Tim,

I need schemas in order to interpret these puzzles. I can propose some, but I’m not sure they would match the scenarios you are trying to illustrate.

You mention somewhere lengthKind=”explicit”….. which I think makes some of these puzzles very easy.

To me, <element name=”a” type=”string” default=”AAAAA” dfdl:lengthKind=”explicit” dfdl:length=”5”/> is really easy. The default will ONLY be used on unparsing to fill in something missing from the infoset. When parsing, 5 characters are consumed. If they’re not available because end of data/parent occurred, then it’s a processing error, but the default would never be used.
<smh> Tim and I agree on this when the dfdl:length property (which can be an expression) evaluates to > 0 </smh>.

Our current draft spec says that when consuming these 5 characters, delimiter scanning is not even occurring. (though we have a separate topic to consider if that is right or not. I think it is right, and if you want to look for a terminator, as well as check for a fixed length, you have to use a dfdl:assert to check the length, but use lengthKind=”delimited”. But I digress.)

<smh> Digression. Using dfdl:assert only helps with the last element in a sequence of fixed length elements. </smh>

I do think we need to consolidate and clarify terminology somewhat, and there’s an important special case we need to clarify.

Missing = Known Not To Exist = Not Existing
Existing = Not Missing

The trick is this: Missing means there is no place where the element could appear. The reason for this stilted language is that we’re trying to allow elements whose declarations admit representations of exactly zero bits, to be classified as Existing, not missing. So the evidence in the bit stream has to go on the other side, toward showing something is NOT there.

Empty implies Existing, but where there is no grammar content region bits. The usual examples would be adjacent delimiters: “AAAAA,,CCCCC”, or initiators: “A:AAAAA, B:, C:CCCCC”

This very special case where an element, including framing, legitimately occupies zero bits of the data stream is the trick. This is the ambiguity that some of your examples are about. This is why there’s this emphasis in the spec language Alan worked so hard on, about whether we can show things *don’t exist* in the data stream. It’s because we want this special case to be allowed to be “existing” even though there are no bits showing evidence of existence at all.

To make things work, I think this special case must be classified as Existing, not Missing. Alternatively we can introduce a special term for this situation. (Call it a Ghost element, for example. It exists, but you can’t see it.).

The trick here is resolving the point of uncertainty, and determining that the length is zero without unbounded speculation. We have to be very careful how much we require speculation to do for us. Specifically, we do *not* want to say that something’s length is zero because speculation forward after this element led to a successful parse. That’s the so-called “squeeze” algorithm for determining where something ends (its length), and I really worry about it. I think there are possibly other places where our draft spec implies this kind of squeeze algorithm for length. (Ex: I am worried about when separator and terminator are the same character. But I digress.)

An element’s length is zero if it has no bits of framing (no initiator nor terminator of its own, no alignment or skip crud) and we immediately encounter terminating markup ( parent separator, parent terminator….), or we encounter end-of-parent/data. Assuming the element is required, has a default, and the length dfdl properties admit a zero-length representation, then in this case we would generate the default as its infoset value, and consume zero bits of the input stream. Example:

<smh> We need to be very careful when using the term 'length'. Length in DFDL only ever means content region. The function dfdl:representationLength() is defined as omitting framing (see spec 23.5.3). If want to talk about total absence of framing as well then we need a new term. Tim and I agree with your second sentence if it uses length to mean content, but not if it includes framing, as it does not cover the case (eg) where I have a dfdl:initiator, zero length content, and a dfdl:separator in force. Note that you allow this case in your 2nd email, so maybe you changed your mind as you progressed through Tim's puzzles? </smh>.

<sequence dfdl:terminator=”;”>
<element name=”speedLimit” type=”int” dfdl:initiator=”” dfdl:terminator=”” default=”100” dfdl:lengthKind=”delimited” />
</sequence>

So the data stream “;” would produce an infoset containing element speedLimit with value 100.

(I intentionally avoided the string case, because of the empty string being a legitimate value, which disables defaulting really.)

<smh> Tim and I have discussed what to do with empty content for xs:string. Spec section 13 currently implies that for this to mean empty string in the infoset, a default value would have to be supplied, set to empty string. I believe this was to make optional and required elements consistent, ie, for an optional element, empty content does not mean empty string, it means 'missing' (as currently defined). Ergo same for a required element, and defaulting then takes place. </smh>

I’ll send an annotated version of your puzzles in a separate message.

From: Tim Kimber [mailto:KIMBERT@uk.ibm.com]
Sent: Thursday, June 02, 2011 4:41 PM
To: mbeckerle.dfdl@gmail.com
Cc: Steve Hanson
Subject: A selection of example data formats

Mike,

Steve asked me to forward this text file that I have put together. I put it together as background material for our discussions about the parsing of DFDL elements and groups.

Key issues:
- The specification uses the terms 'empty', 'missing' and 'known not to exist' in reference to elements. We need to work out what these terms mean so that the spec can be made clearer.
- In my opinion, the terms 'missing' and 'known not to exist' should not have different meanings - it invites criticism. If 'missing' means something different from 'known not to exist' then we need a different word or phrase.
- The application of default values for missing required elements in the parser is problematic. I think Steve may have sent you an email about this, so I won't outline the issues here ( Steve, please can you forward your email to me ).

Disclaimer : This set of data formats does not highlight all of the unresolved questions around the parsing of groups - only the ones that were in play at the time I produced the document.

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU