Mike
I realise that this is the first of your two e-mails on this subject, but
I've added some comments anyway to keep things in context...more to come
on your 2nd mail.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
"Mike Beckerle"
To:
Tim Kimber/UK/IBM@IBMGB
Cc:
Steve Hanson/UK/IBM@IBMGB
Date:
11/06/2011 01:56
Subject:
RE: A selection of example data formats
Tim,
I need schemas in order to interpret these puzzles. I can propose some,
but I’m not sure they would match the scenarios you are trying to
illustrate.
You mention somewhere lengthKind=”explicit”….. which I think makes some of
these puzzles very easy.
To me, <element name=”a” type=”string” default=”AAAAA”
dfdl:lengthKind=”explicit” dfdl:length=”5”/> is really easy. The default
will ONLY be used on unparsing to fill in something missing from the
infoset. When parsing, 5 characters are consumed. If they’re not
available because end of data/parent occurred, then it’s a processing
error, but the default would never be used.
<smh> Tim and I agree on this when the dfdl:length property (which can be
an expression) evaluates to > 0 </smh>.
Our current draft spec says that when consuming these 5 characters,
delimiter scanning is not even occurring. (though we have a separate topic
to consider if that is right or not. I think it is right, and if you want
to look for a terminator, as well as check for a fixed length, you have to
use a dfdl:assert to check the length, but use lengthKind=”delimited”. But
I digress.)
<smh> Digression. Using dfdl:assert only helps with the last element in a
sequence of fixed length elements. </smh>
I do think we need to consolidate and clarify terminology somewhat, and
there’s an important special case we need to clarify.
Missing = Known Not To Exist = Not Existing
Existing = Not Missing
The trick is this: Missing means there is no place where the element could
appear. The reason for this stilted language is that we’re trying to
allow elements whose declarations admit representations of exactly zero
bits, to be classified as Existing, not missing. So the evidence in the
bit stream has to go on the other side, toward showing something is NOT
there.
Empty implies Existing, but where there is no grammar content region bits.
The usual examples would be adjacent delimiters: “AAAAA,,CCCCC”, or
initiators: “A:AAAAA, B:, C:CCCCC”
This very special case where an element, including framing, legitimately
occupies zero bits of the data stream is the trick. This is the ambiguity
that some of your examples are about. This is why there’s this emphasis in
the spec language Alan worked so hard on, about whether we can show things
*don’t exist* in the data stream. It’s because we want this special case
to be allowed to be “existing” even though there are no bits showing
evidence of existence at all.
To make things work, I think this special case must be classified as
Existing, not Missing. Alternatively we can introduce a special term for
this situation. (Call it a Ghost element, for example. It exists, but you
can’t see it.).
The trick here is resolving the point of uncertainty, and determining that
the length is zero without unbounded speculation. We have to be very
careful how much we require speculation to do for us. Specifically, we do
*not* want to say that something’s length is zero because speculation
forward after this element led to a successful parse. That’s the so-called
“squeeze” algorithm for determining where something ends (its length), and
I really worry about it. I think there are possibly other places where our
draft spec implies this kind of squeeze algorithm for length. (Ex: I am
worried about when separator and terminator are the same character. But I
digress.)
An element’s length is zero if it has no bits of framing (no initiator nor
terminator of its own, no alignment or skip crud) and we immediately
encounter terminating markup ( parent separator, parent terminator….), or
we encounter end-of-parent/data. Assuming the element is required, has a
default, and the length dfdl properties admit a zero-length
representation, then in this case we would generate the default as its
infoset value, and consume zero bits of the input stream. Example:
<smh> We need to be very careful when using the term 'length'. Length in
DFDL only ever means content region. The function
dfdl:representationLength() is defined as omitting framing (see spec
23.5.3). If want to talk about total absence of framing as well then we
need a new term. Tim and I agree with your second sentence if it uses
length to mean content, but not if it includes framing, as it does not
cover the case (eg) where I have a dfdl:initiator, zero length content,
and a dfdl:separator in force. Note that you allow this case in your 2nd
email, so maybe you changed your mind as you progressed through Tim's
puzzles? </smh>.
<sequence dfdl:terminator=”;”>
<element name=”speedLimit” type=”int” dfdl:initiator=”” dfdl:terminator=””
default=”100” dfdl:lengthKind=”delimited” />
</sequence>
So the data stream “;” would produce an infoset containing element
speedLimit with value 100.
(I intentionally avoided the string case, because of the empty string
being a legitimate value, which disables defaulting really.)
<smh> Tim and I have discussed what to do with empty content for
xs:string. Spec section 13 currently implies that for this to mean empty
string in the infoset, a default value would have to be supplied, set to
empty string. I believe this was to make optional and required elements
consistent, ie, for an optional element, empty content does not mean empty
string, it means 'missing' (as currently defined). Ergo same for a
required element, and defaulting then takes place. </smh>
I’ll send an annotated version of your puzzles in a separate message.
From: Tim Kimber [mailto:KIMBERT@uk.ibm.com]
Sent: Thursday, June 02, 2011 4:41 PM
To: mbeckerle.dfdl@gmail.com
Cc: Steve Hanson
Subject: A selection of example data formats
Mike,
Steve asked me to forward this text file that I have put together. I put
it together as background material for our discussions about the parsing
of DFDL elements and groups.
Key issues:
- The specification uses the terms 'empty', 'missing' and 'known not to
exist' in reference to elements. We need to work out what these terms mean
so that the spec can be made clearer.
- In my opinion, the terms 'missing' and 'known not to exist' should not
have different meanings - it invites criticism. If 'missing' means
something different from 'known not to exist' then we need a different
word or phrase.
- The application of default values for missing required elements in the
parser is problematic. I think Steve may have sent you an email about
this, so I won't outline the issues here ( Steve, please can you forward
your email to me ).
Disclaimer : This set of data formats does not highlight all of the
unresolved questions around the parsing of groups - only the ones that
were in play at the time I produced the document.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU