Mike
I realise that this is the first of
your two e-mails on this subject, but I've added some comments anyway to
keep things in context...more to come on your 2nd mail.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
| "Mike Beckerle" <mbeckerle.dfdl@gmail.com>
|
To:
| Tim Kimber/UK/IBM@IBMGB
|
Cc:
| Steve Hanson/UK/IBM@IBMGB
|
Date:
| 11/06/2011 01:56
|
Subject:
| RE: A selection of example data formats |
Tim,
I need schemas in order to
interpret these puzzles. I can propose some, but I’m not sure they
would match the scenarios you are trying to illustrate.
You mention somewhere lengthKind=”explicit”…..
which I think makes some of these puzzles very easy.
To me, <element name=”a”
type=”string” default=”AAAAA” dfdl:lengthKind=”explicit” dfdl:length=”5”/>
is really easy. The default will ONLY be used on unparsing to fill in something
missing from the infoset. When parsing, 5 characters are consumed.
If they’re not available because end of data/parent occurred, then it’s
a processing error, but the default would never be used.
<smh> Tim and I agree on
this when the dfdl:length property (which can be an expression) evaluates
to > 0 </smh>.
Our current draft spec says
that when consuming these 5 characters, delimiter scanning is not even
occurring. (though we have a separate topic to consider if that is right
or not. I think it is right, and if you want to look for a terminator,
as well as check for a fixed length, you have to use a dfdl:assert to check
the length, but use lengthKind=”delimited”. But I digress.)
<smh> Digression. Using
dfdl:assert only helps with the last element in a sequence of fixed length
elements. </smh>
I do think we need to consolidate
and clarify terminology somewhat, and there’s an important special case
we need to clarify.
Missing = Known Not To Exist
= Not Existing
Existing = Not Missing
The trick is this: Missing
means there is no place where the element could appear. The reason
for this stilted language is that we’re trying to allow elements whose
declarations admit representations of exactly zero bits, to be classified
as Existing, not missing. So the evidence in the bit stream has to go on
the other side, toward showing something is NOT there.
Empty implies Existing, but
where there is no grammar content region bits. The usual examples would
be adjacent delimiters: “AAAAA,,CCCCC”, or initiators: “A:AAAAA, B:,
C:CCCCC”
This very special case where
an element, including framing, legitimately occupies zero bits of the data
stream is the trick. This is the ambiguity that some of your examples are
about. This is why there’s this emphasis in the spec language Alan worked
so hard on, about whether we can show things *don’t exist* in the
data stream. It’s because we want this special case to be allowed to be
“existing” even though there are no bits showing evidence of existence
at all.
To make things work, I think
this special case must be classified as Existing, not Missing. Alternatively
we can introduce a special term for this situation. (Call it a Ghost
element, for example. It exists, but you can’t see it.).
The trick here is resolving
the point of uncertainty, and determining that the length is zero without
unbounded speculation. We have to be very careful how much we require
speculation to do for us. Specifically, we do *not* want to say
that something’s length is zero because speculation forward after this
element led to a successful parse. That’s the so-called “squeeze” algorithm
for determining where something ends (its length), and I really worry about
it. I think there are possibly other places where our draft spec implies
this kind of squeeze algorithm for length. (Ex: I am worried about when
separator and terminator are the same character. But I digress.)
An element’s length is zero
if it has no bits of framing (no initiator nor terminator of its own, no
alignment or skip crud) and we immediately encounter terminating markup
( parent separator, parent terminator….), or we encounter end-of-parent/data.
Assuming the element is required, has a default, and the length dfdl
properties admit a zero-length representation, then in this case we would
generate the default as its infoset value, and consume zero bits of the
input stream. Example:
<smh> We need to be very
careful when using the term 'length'. Length in DFDL only ever means content
region. The function dfdl:representationLength() is defined as omitting
framing (see spec 23.5.3). If want to talk about total absence of framing
as well then we need a new term. Tim and I agree with your second
sentence if it uses length to mean content, but not if it includes framing,
as it does not cover the case (eg) where I have a dfdl:initiator, zero
length content, and a dfdl:separator in force. Note that you allow this
case in your 2nd email, so maybe you changed your mind as you progressed
through Tim's puzzles? </smh>.
<sequence dfdl:terminator=”;”>
<element name=”speedLimit”
type=”int” dfdl:initiator=”” dfdl:terminator=”” default=”100” dfdl:lengthKind=”delimited”
/>
</sequence>
So the data stream “;”
would produce an infoset containing element speedLimit with value 100.
(I intentionally avoided
the string case, because of the empty string being a legitimate value,
which disables defaulting really.)
<smh> Tim and I have discussed
what to do with empty content for xs:string. Spec section 13 currently
implies that for this to mean empty string in the infoset, a default value
would have to be supplied, set to empty string. I believe this was to make
optional and required elements consistent, ie, for an optional element,
empty content does not mean empty string, it means 'missing' (as currently
defined). Ergo same for a required element, and defaulting then takes place.
</smh>
I’ll send an annotated version
of your puzzles in a separate message.
From: Tim Kimber [mailto:KIMBERT@uk.ibm.com]
Sent: Thursday, June 02, 2011 4:41 PM
To: mbeckerle.dfdl@gmail.com
Cc: Steve Hanson
Subject: A selection of example data formats
Mike,
Steve asked me to forward this text file that I have put together. I put
it together as background material for our discussions about the parsing
of DFDL elements and groups.
Key issues:
- The specification uses the terms 'empty', 'missing' and 'known not to
exist' in reference to elements. We need to work out what these terms mean
so that the spec can be made clearer.
- In my opinion, the terms 'missing' and 'known not to exist' should not
have different meanings - it invites criticism. If 'missing' means something
different from 'known not to exist' then we need a different word or phrase.
- The application of default values for missing required elements in
the parser is problematic. I think Steve may have sent you an email
about this, so I won't outline the issues here ( Steve, please can you
forward your email to me ).
Disclaimer : This set of data formats does not highlight all of the unresolved
questions around the parsing of groups - only the ones that were in play
at the time I produced the document.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU