Agreed on last call that dfdl:lengthKind
'delimited' would not be changed, specifically it will not attempt to look
for in-scope delimiters in-between child elements whose lengthKind is not
'delimited'.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve
Hanson/UK/IBM on 04/07/2012 11:53 -----
From:
Steve Hanson/UK/IBM
To:
dfdl-wg@ogf.org
Date:
11/06/2012 11:51
Subject:
Fw: DFDL and
the truncated SAP File IDoc format
For next DFDL WG call. Some thoughts
on whether lengthKind 'delimited' should be able to model this without
resorting to asserts. Read from bottom.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve
Hanson/UK/IBM on 11/06/2012 11:50 -----
From:
Tim Kimber/UK/IBM
To:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
Cc:
Steve Hanson/UK/IBM@IBMGB
Date:
30/05/2012 21:23
Subject:
Re: Fw: DFDL
and the truncated SAP File IDoc format
Thanks Mike - useful input. I've added
my comments in <tk> tags
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB
Cc:
Tim Kimber/UK/IBM@IBMGB
Date:
30/05/2012 19:59
Subject:
Re: Fw: DFDL
and the truncated SAP File IDoc format
Hmmm.
We discussed at one time whether there are actually 2 different delimiting
schemes. One is what we have now. Let me call this "delimited1".
In delimited1, an enclosing parent's delimiter cannot be used in isolation
to find the extent of the data, because child elements might have escape
schemes defined which escape even the parent delimiter, so you still have
to use the recursive definition of the children when parsing. This
is a very powerful mode of parsing. However, many things that might be
errors (putting a binary field in the middle of a bunch of text fields),
would be tolerated by this regime, because scanning would be turned on/off
appropriately.
I struggle, however, with whether delimited1 is really the same thing as
"implicit". I mean if you define an element as 'implicit' but
it has a terminator, then after you unwind from the recursion you are still
going to then look for the terminator, so it's not like the delimiters
are being ignored.
<tk>
I think of it this way. The
lengthKind property is about the length of the *content* region. So 'delimited1'
is, I think, the same as 'implicit' for the purposes of finding the length
of the content region. If the complex element has a terminator then the
terminator will be expected at the byte offset that immediately follows
the end of the content region - whether lengthKind is 'delimited' or 'implicit'.
In other words, I'm modifying your description of the behaviour to "after
you unwind from the recursion you are still going to then look for the
terminator at the byte
offset immediately following the element's content"
</tk>
The other definition of delimited (let's called it delimited2), would be
where you get to completely disregard the children when searching for the
parent delimiter. Many things appearing within the children would be SDE.
E.g., binary format children would be an SDE, etc. Delimited2 would imply
that the children are all representation="text", and the scan
for the parent delimiter would be irrespective of any delimiters and escape
schemes being put in place by child elements. So for example, the last
child inside a delimited2 parent could have length kind = "endOfData"
just fine, because we can isolate the "box" of data first, and
then parse the children within it, with the last child extending to the
end of the "box".
<tk>
You mean 'endOfParent' but it doesn;t
change your point, which is valid.
My concern with your description is
the implication that the parser needs to scan the same data multiple times.
Maybe there are ways to analyse the model and avoid that necessity for
many types of model, but that may be easier said than done.
My proposal was to respect the lengthKind
of each child element within the parent delimited element, but to check
for the terminator of the element, of its main group, and for any other
enclosing terminating delimiters before continuing to parse any member
of the group. I'm prepared to be convinced that this approach is shot full
of logical inconsistencies, btw.
</tk>
...mike
On Wed, May 30, 2012 at 1:10 PM, Steve Hanson <smh@uk.ibm.com>
wrote:
Hi Mike
Interested in your opinion on this one...it was prompted by looking at
the best way to model a format where each record consisted of fixed length
optional fields 1 to n followed by an EOR indicator, where missing trailing
fields are suppressed. Kind of analogous to suppressing trailing
delimiters for empty fields.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 30/05/2012 18:05 -----
From: Tim
Kimber/UK/IBM
To: Steve
Hanson/UK/IBM@IBMGB
Date: 15/05/2012
12:27
Subject: Fw:
DFDL and the truncated SAP File IDoc format
I've thought about this a bit more...
The already-existing rule about lengthKind=delimited versus lengthKind=implicit
only applies when the parser is about to parse the content region of an
element, and needs to decide whether to recurse into its content. If the
elements own lengthKind is 'delimited' then it does not recurse.
The rule that you are proposing goes further than that, and requires that
lengthKind=delimited is taken literally; the length of the complex element
truly is defined by the in-scope delimiters, including its own terminator.
I like that rule, actually - it gives real meaning to lengthKind=delimited.
The problem is defining the behaviour, because the rule has implications
for the parsing of the element's group. Before parsing each member of the
group ( required or not, I think) , the parser must check for in-scope
delimiters. This only needs to happen if the immediate parent of the group
is an element with lengthKind=delimited or endOfParent. I'm sure there
are edge cases around this ( what about embedded groups ) so we should
discuss this with Mike.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
----- Forwarded by Tim Kimber/UK/IBM on 15/05/2012 12:12 -----
From: Tim
Kimber/UK/IBM
To: Steve
Hanson/UK/IBM@IBMGB
Date: 15/05/2012
11:19
Subject: Re:
DFDL and the truncated SAP File IDoc format
When a modeller sets lengthKind to 'delimited' they are implicitly
claiming that the element's content region will not contain any of the
in-scope delimiters ( unless they are escaped ). That makes it safe for
the parser to look for *all* in-scope delimiters when scanning. When they
set lengthKind='explicit' they are not making any such claim. Well...nearly.
We already have a rule in DFDL that distinguishes between a strict
behaviour when lengthKind='implicit' a lax-but-more-efficient behaviour
when lengthKind=delimited. I think that may be the justification for your
rule.
This has prompted me to think about how we discuss this delimited/implicit
distinction in the DFDL specification. I think it might be useful to cast
the discussion in terms of what is allowed in the content of the element.
If the parser might encounter the already-in-scope delimiters as part of
its content ( either within explicit-length fields or as the delimiters
of child elements/groups ) then lengthKind must be 'implicit'. If
the parser can safely assume that delimiters never occur within the element's
content, or that they are always escaped, then lengthKind='delimited' is
the better choice.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
From: Steve
Hanson/UK/IBM
To: Tim
Kimber/UK/IBM
Date: 15/05/2012
09:20
Subject: DFDL
and the truncated SAP File IDoc format
Hi Tim
Looking at Emma's format got me thinking about errata 3.3.
3.3. Section 12.3. Clarify
that when property is lengthKind 'explicit', 'implicit' (simple only),
'prefixed' or 'pattern', it means that delimiter scanning is turned off
and in-scope delimiters are not looked for within or between elements.
I am absolutely clear on why the parser would not want to look for in-scope
delimiters within such elements. I'm also happy not to look for delimiters
between elements if the element is required. But why shouldn't the parser
look between elements when the element is optional? Or at least when
the remaining content is all optional? There's an analogy here with
trailing separator suppression, that I don't think we spotted before. Were
we worried that users would be using unescaped characters because the data
is fixed length?
If my format was some required fixed length fields followed by some optional
fixed length fields, with an indicator for end of record, I would like
to be able to model it very simply, as follows.
<xs:element name="record" dfdl:lengthKind="delimited"
dfdl:terminator="%LF;" >
<xs:complexType>
<xs:sequence>
<xs:element name="A" type="xs:string"
dfdl:lengthKind="explicit" dfdl:length="10" />
<xs:element name="B" type="xs:string"
dfdl:lengthKind="explicit" dfdl:length="10" />
<xs:element name="C" type="xs:string"
dfdl:lengthKind="explicit" dfdl:length="10" minOccurs="0"
/>
<xs:element name="D" type="xs:string"
dfdl:lengthKind="explicit" dfdl:length="10" minOccurs="0"
/>
<xs:element name="E" type="xs:string"
dfdl:lengthKind="explicit" dfdl:length="10" minOccurs="0"
/>
</xs:sequence>
</xs:complexType>
</xs:element>
If DFDL doesn't allow this it means I need either dfdl:lengthKind="pattern"
on the record element, or I need an assert on each element checking the
content is not line feed.
You can argue that using 'pattern' instead of 'delimited' is no big deal,
but using 'delimited' is a more natural fit and what a modeler would think
of first.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
Mike Beckerle | OGF DFDL WG Co-Chair
Tel: 781-330-0412
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU