For next DFDL WG call. Some thoughts on whether lengthKind 'delimited'
should be able to model this without resorting to asserts. Read from
bottom.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh(a)uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 11/06/2012 11:50 -----
From: Tim Kimber/UK/IBM
To: Mike Beckerle <mbeckerle.dfdl(a)gmail.com>
Cc: Steve Hanson/UK/IBM@IBMGB
Date: 30/05/2012 21:23
Subject: Re: Fw: DFDL and the truncated SAP File IDoc format
Thanks Mike - useful input. I've added my comments in <tk> tags
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert(a)uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com>
To: Steve Hanson/UK/IBM@IBMGB
Cc: Tim Kimber/UK/IBM@IBMGB
Date: 30/05/2012 19:59
Subject: Re: Fw: DFDL and the truncated SAP File IDoc format
Hmmm.
We discussed at one time whether there are actually 2 different delimiting
schemes. One is what we have now. Let me call this "delimited1". In
delimited1, an enclosing parent's delimiter cannot be used in isolation to
find the extent of the data, because child elements might have escape
schemes defined which escape even the parent delimiter, so you still have
to use the recursive definition of the children when parsing. This is a
very powerful mode of parsing. However, many things that might be errors
(putting a binary field in the middle of a bunch of text fields), would be
tolerated by this regime, because scanning would be turned on/off
appropriately.
I struggle, however, with whether delimited1 is really the same thing as
"implicit". I mean if you define an element as 'implicit' but it has a
terminator, then after you unwind from the recursion you are still going
to then look for the terminator, so it's not like the delimiters are being
ignored.
<tk>
I think of it this way. The lengthKind property is about the length of the
*content* region. So 'delimited1' is, I think, the same as 'implicit' for
the purposes of finding the length of the content region. If the complex
element has a terminator then the terminator will be expected at the byte
offset that immediately follows the end of the content region - whether
lengthKind is 'delimited' or 'implicit'. In other words, I'm modifying
your description of the behaviour to "after you unwind from the recursion
you are still going to then look for the terminator at the byte offset
immediately following the element's content"
</tk>
The other definition of delimited (let's called it delimited2), would be
where you get to completely disregard the children when searching for the
parent delimiter. Many things appearing within the children would be SDE.
E.g., binary format children would be an SDE, etc. Delimited2 would imply
that the children are all representation="text", and the scan for the
parent delimiter would be irrespective of any delimiters and escape
schemes being put in place by child elements. So for example, the last
child inside a delimited2 parent could have length kind = "endOfData" just
fine, because we can isolate the "box" of data first, and then parse the
children within it, with the last child extending to the end of the "box".
<tk>
You mean 'endOfParent' but it doesn;t change your point, which is valid.
My concern with your description is the implication that the parser needs
to scan the same data multiple times. Maybe there are ways to analyse the
model and avoid that necessity for many types of model, but that may be
easier said than done.
My proposal was to respect the lengthKind of each child element within the
parent delimited element, but to check for the terminator of the element,
of its main group, and for any other enclosing terminating delimiters
before continuing to parse any member of the group. I'm prepared to be
convinced that this approach is shot full of logical inconsistencies, btw.
</tk>
...mike
On Wed, May 30, 2012 at 1:10 PM, Steve Hanson <smh(a)uk.ibm.com> wrote:
Hi Mike
Interested in your opinion on this one...it was prompted by looking at the
best way to model a format where each record consisted of fixed length
optional fields 1 to n followed by an EOR indicator, where missing
trailing fields are suppressed. Kind of analogous to suppressing trailing
delimiters for empty fields.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh(a)uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 30/05/2012 18:05 -----
From: Tim Kimber/UK/IBM
To: Steve Hanson/UK/IBM@IBMGB
Date: 15/05/2012 12:27
Subject: Fw: DFDL and the truncated SAP File IDoc format
I've thought about this a bit more...
The already-existing rule about lengthKind=delimited versus
lengthKind=implicit only applies when the parser is about to parse the
content region of an element, and needs to decide whether to recurse into
its content. If the elements own lengthKind is 'delimited' then it does
not recurse. The rule that you are proposing goes further than that, and
requires that lengthKind=delimited is taken literally; the length of the
complex element truly is defined by the in-scope delimiters, including its
own terminator. I like that rule, actually - it gives real meaning to
lengthKind=delimited. The problem is defining the behaviour, because the
rule has implications for the parsing of the element's group. Before
parsing each member of the group ( required or not, I think) , the parser
must check for in-scope delimiters. This only needs to happen if the
immediate parent of the group is an element with lengthKind=delimited or
endOfParent. I'm sure there are edge cases around this ( what about
embedded groups ) so we should discuss this with Mike.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert(a)uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
----- Forwarded by Tim Kimber/UK/IBM on 15/05/2012 12:12 -----
From: Tim Kimber/UK/IBM
To: Steve Hanson/UK/IBM@IBMGB
Date: 15/05/2012 11:19
Subject: Re: DFDL and the truncated SAP File IDoc format
When a modeller sets lengthKind to 'delimited' they are implicitly
claiming that the element's content region will not contain any of the
in-scope delimiters ( unless they are escaped ). That makes it safe for
the parser to look for *all* in-scope delimiters when scanning. When they
set lengthKind='explicit' they are not making any such claim.
Well...nearly. We already have a rule in DFDL that distinguishes between
a strict behaviour when lengthKind='implicit' a lax-but-more-efficient
behaviour when lengthKind=delimited. I think that may be the justification
for your rule.
This has prompted me to think about how we discuss this delimited/implicit
distinction in the DFDL specification. I think it might be useful to cast
the discussion in terms of what is allowed in the content of the element.
If the parser might encounter the already-in-scope delimiters as part of
its content ( either within explicit-length fields or as the delimiters of
child elements/groups ) then lengthKind must be 'implicit'. If the parser
can safely assume that delimiters never occur within the element's
content, or that they are always escaped, then lengthKind='delimited' is
the better choice.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert(a)uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
From: Steve Hanson/UK/IBM
To: Tim Kimber/UK/IBM
Date: 15/05/2012 09:20
Subject: DFDL and the truncated SAP File IDoc format
Hi Tim
Looking at Emma's format got me thinking about errata 3.3.
3.3. Section 12.3. Clarify that when property is lengthKind 'explicit',
'implicit' (simple only), 'prefixed' or 'pattern', it means that delimiter
scanning is turned off and in-scope delimiters are not looked for within
or between elements.
I am absolutely clear on why the parser would not want to look for
in-scope delimiters within such elements. I'm also happy not to look for
delimiters between elements if the element is required. But why shouldn't
the parser look between elements when the element is optional? Or at
least when the remaining content is all optional? There's an analogy here
with trailing separator suppression, that I don't think we spotted before.
Were we worried that users would be using unescaped characters because
the data is fixed length?
If my format was some required fixed length fields followed by some
optional fixed length fields, with an indicator for end of record, I would
like to be able to model it very simply, as follows.
<xs:element name="record" dfdl:lengthKind="delimited"
dfdl:terminator="%LF;" >
<xs:complexType>
<xs:sequence>
<xs:element name="A" type="xs:string" dfdl:lengthKind="explicit"
dfdl:length="10" />
<xs:element name="B" type="xs:string" dfdl:lengthKind="explicit"
dfdl:length="10" />
<xs:element name="C" type="xs:string" dfdl:lengthKind="explicit"
dfdl:length="10" minOccurs="0" />
<xs:element name="D" type="xs:string" dfdl:lengthKind="explicit"
dfdl:length="10" minOccurs="0" />
<xs:element name="E" type="xs:string" dfdl:lengthKind="explicit"
dfdl:length="10" minOccurs="0" />
</xs:sequence>
</xs:complexType>
</xs:element>
If DFDL doesn't allow this it means I need either
dfdl:lengthKind="pattern" on the record element, or I need an assert on
each element checking the content is not line feed.
You can argue that using 'pattern' instead of 'delimited' is no big deal,
but using 'delimited' is a more natural fit and what a modeler would think
of first.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh(a)uk.ibm.com
tel:+44-1962-815848
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
Mike Beckerle | OGF DFDL WG Co-Chair
Tel: 781-330-0412
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU