Re: [DFDL-WG] A selection of example data formats #1

15 Jun 2011

      Mike

I realise that this is the first of your two e-mails on this subject, but 
I've added some comments anyway to keep things in context...more to come 
on your 2nd mail.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From:
"Mike Beckerle" <mbeckerle.dfdl@gmail.com>
To:
Tim Kimber/UK/IBM@IBMGB
Cc:
Steve Hanson/UK/IBM@IBMGB
Date:
11/06/2011 01:56
Subject:
RE: A selection of example data formats

Tim, 

I need schemas in order to interpret these puzzles.  I can propose some, 
but I’m not sure they would match the scenarios you are trying to 
illustrate.

You mention somewhere lengthKind=”explicit”….. which I think makes some of 
these puzzles very easy. 

To me, <element name=”a”  type=”string” default=”AAAAA” 
dfdl:lengthKind=”explicit” dfdl:length=”5”/> is really easy. The default 
will ONLY be used on unparsing to fill in something missing from the 
infoset.  When parsing, 5 characters are consumed. If they’re not 
available because end of data/parent occurred, then it’s a processing 
error, but the default would never be used. 
<smh> Tim and I agree on this when the dfdl:length property (which can be 
an expression) evaluates to > 0 </smh>. 

Our current draft spec says that when consuming these 5 characters, 
delimiter scanning is not even occurring. (though we have a separate topic 
to consider if that is right or not. I think it is right, and if you want 
to look for a terminator, as well as check for a fixed length, you have to 
use a dfdl:assert to check the length, but use lengthKind=”delimited”. But 
I digress.)

<smh> Digression. Using dfdl:assert only helps with the last element in a 
sequence of fixed length elements. </smh>

I do think we need to consolidate and clarify terminology somewhat, and 
there’s an important special case we need to clarify.

Missing = Known Not To Exist = Not Existing
Existing = Not Missing

The trick is this: Missing means there is no place where the element could 
appear.  The reason for this stilted language is that we’re trying to 
allow elements whose declarations admit representations of exactly zero 
bits, to be classified as Existing, not missing. So the evidence in the 
bit stream has to go on the other side, toward showing something is NOT 
there. 

Empty implies Existing, but where there is no grammar content region bits. 
The usual examples would be adjacent delimiters: “AAAAA,,CCCCC”, or 
initiators: “A:AAAAA, B:, C:CCCCC”

This very special case where an element, including framing, legitimately 
occupies zero bits of the data stream is the trick. This is the ambiguity 
that some of your examples are about. This is why there’s this emphasis in 
the spec language Alan worked so hard on, about whether we can show things 
*don’t exist* in the data stream. It’s because we want this special case 
to be allowed to be “existing” even though there are no bits showing 
evidence of existence at all.

To make things work, I think this special case must be classified as 
Existing, not Missing. Alternatively we can introduce a special term for 
this situation.  (Call it a Ghost element, for example. It exists, but you 
can’t see it.).

The trick here is resolving the point of uncertainty, and determining that 
the length is zero without unbounded speculation.  We have to be very 
careful how much we require speculation to do for us. Specifically, we do 
*not* want to say that something’s length is zero because speculation 
forward after this element led to a successful parse. That’s the so-called 
“squeeze” algorithm for determining where something ends (its length), and 
I really worry about it. I think there are possibly other places where our 
draft spec implies this kind of squeeze algorithm for length. (Ex: I am 
worried about when separator and terminator are the same character. But I 
digress.)

An element’s length is zero if it has no bits of framing (no initiator nor 
terminator of its own, no alignment or skip crud) and we immediately 
encounter terminating markup ( parent separator, parent terminator….), or 
we encounter end-of-parent/data.  Assuming the element is required, has a 
default, and the length dfdl properties admit a zero-length 
representation, then in this case we would generate the default as its 
infoset value, and consume zero bits of the input stream. Example:

<smh> We need to be very careful when using the term 'length'. Length in 
DFDL only ever means content region. The function 
dfdl:representationLength() is defined as omitting framing (see spec 
23.5.3). If want to talk about total absence of framing as well then we 
need a new term.  Tim and I agree with your second sentence if it uses 
length to mean content, but not if it includes framing, as it does not 
cover the case (eg) where I have a dfdl:initiator, zero length content, 
and a dfdl:separator in force. Note that you allow this case in your 2nd 
email, so maybe you changed your mind as you progressed through Tim's 
puzzles?  </smh>. 

<sequence dfdl:terminator=”;”>
<element name=”speedLimit” type=”int” dfdl:initiator=”” dfdl:terminator=”” 
default=”100” dfdl:lengthKind=”delimited” />
</sequence>

So the data stream “;” would produce an infoset containing element 
speedLimit with value 100. 

(I intentionally avoided the string case, because of the empty string 
being a legitimate value, which disables defaulting really.)

<smh> Tim and I have discussed what to do with empty content for 
xs:string. Spec section 13 currently implies that for this to mean empty 
string in the infoset, a default value would have to be supplied, set to 
empty string. I believe this was to make optional and required elements 
consistent, ie, for an optional element, empty content does not mean empty 
string, it means 'missing' (as currently defined). Ergo same for a 
required element, and defaulting then takes place. </smh>

I’ll send an annotated version of your puzzles in a separate message.

From: Tim Kimber [mailto:KIMBERT@uk.ibm.com] 
Sent: Thursday, June 02, 2011 4:41 PM
To: mbeckerle.dfdl@gmail.com
Cc: Steve Hanson
Subject: A selection of example data formats

Mike, 

Steve asked me to forward this text file that I have put together. I put 
it together as background material for our discussions about the parsing 
of DFDL elements and groups. 

Key issues: 
- The specification uses the terms 'empty', 'missing' and 'known not to 
exist' in reference to elements. We need to work out what these terms mean 
so that the spec can be made clearer. 
- In my opinion, the terms 'missing' and 'known not to exist' should not 
have different meanings - it invites criticism. If 'missing' means 
something different from 'known not to exist' then we need a different 
word or phrase. 
- The application of default values for missing required elements in the 
parser is problematic. I think Steve may have sent you an email about 
this, so I won't outline the issues here ( Steve, please can you forward 
your email to me ). 

Disclaimer : This set of data formats does not highlight all of the 
unresolved questions around the parsing of groups - only the ones that 
were in play at the time I produced the document. 

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU