Agreed on last call that dfdl:lengthKind 'delimited' would not be changed, specifically it will not attempt to look for in-scope delimiters in-between child elements whose lengthKind is not 'delimited'.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair,
OGF DFDL Working Group
IBM SWG, Hursley, UK

smh@uk.ibm.com
tel:+44-1962-815848

----- Forwarded by Steve Hanson/UK/IBM on 04/07/2012 11:53 -----

From:        Steve Hanson/UK/IBM
To:        dfdl-wg@ogf.org
Date:        11/06/2012 11:51
Subject:        Fw: DFDL and the truncated SAP File IDoc format



For next DFDL WG call. Some thoughts on whether lengthKind 'delimited' should be able to model this without resorting to asserts. Read from bottom.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair,
OGF DFDL Working Group
IBM SWG, Hursley, UK

smh@uk.ibm.com
tel:+44-1962-815848

----- Forwarded by Steve Hanson/UK/IBM on 11/06/2012 11:50 -----

From:        Tim Kimber/UK/IBM
To:        Mike Beckerle <mbeckerle.dfdl@gmail.com>
Cc:        Steve Hanson/UK/IBM@IBMGB
Date:        30/05/2012 21:23
Subject:        Re: Fw: DFDL and the truncated SAP File IDoc format



Thanks Mike - useful input. I've added my comments in <tk> tags

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742  
Internal tel. 246742




From:        Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:        Steve Hanson/UK/IBM@IBMGB
Cc:        Tim Kimber/UK/IBM@IBMGB
Date:        30/05/2012 19:59
Subject:        Re: Fw: DFDL and the truncated SAP File IDoc format




Hmmm.

We discussed at one time whether there are actually 2 different delimiting schemes. One is what we have now. Let me call this "delimited1". In delimited1, an enclosing parent's delimiter cannot be used in isolation to find the extent of the data, because child elements might have escape schemes defined which escape even the parent delimiter, so you still have to use the recursive definition of the children when parsing.  This is a very powerful mode of parsing. However, many things that might be errors (putting a binary field in the middle of a bunch of text fields), would be tolerated by this regime, because scanning would be turned on/off appropriately.

I struggle, however, with whether delimited1 is really the same thing as "implicit". I mean if you define an element as 'implicit' but it has a terminator, then after you unwind from the recursion you are still going to then look for the terminator, so it's not like the delimiters are being ignored.
<tk>
I think of it this way. The lengthKind property is about the length of the *content* region. So 'delimited1' is, I think, the same as 'implicit' for the purposes of finding the length of the content region. If the complex element has a terminator then the terminator will be expected at the byte offset that immediately follows the end of the content region - whether lengthKind is 'delimited' or 'implicit'. In other words, I'm modifying your description of the behaviour to "after you unwind from the recursion you are still going to then look for the terminator at the byte offset immediately following the element's content"
</tk>
The other definition of delimited (let's called it delimited2), would be where you get to completely disregard the children when searching for the parent delimiter. Many things appearing within the children would be SDE. E.g., binary format children would be an SDE, etc. Delimited2 would imply that the children are all representation="text", and the scan for the parent delimiter would be irrespective of any delimiters and escape schemes being put in place by child elements. So for example, the last child inside a delimited2 parent could have length kind = "endOfData" just fine, because we can isolate the "box" of data first, and then parse the children within it, with the last child extending to the end of the "box".
<tk>
You mean 'endOfParent' but it doesn;t change your point, which is valid.
My concern with your description is the implication that the parser needs to scan the same data multiple times. Maybe there are ways to analyse the model and avoid that necessity for many types of model, but that may be easier said than done.
My proposal was to respect the lengthKind of each child element within the parent delimited element, but to check for the terminator of the element, of its main group, and for any other enclosing terminating delimiters before continuing to parse any member of the group. I'm prepared to be convinced that this approach is shot full of logical inconsistencies, btw.
</tk>

...mike




On Wed, May 30, 2012 at 1:10 PM, Steve Hanson <smh@uk.ibm.com> wrote:
Hi Mike

Interested in your opinion on this one...it was prompted by looking at the best way to model a format where each record consisted of fixed length optional fields 1 to n followed by an EOR indicator, where missing trailing fields are suppressed.  Kind of analogous to suppressing trailing delimiters for empty fields.  


Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair,
OGF DFDL Working Group
IBM SWG, Hursley, UK

smh@uk.ibm.com
tel:
+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 30/05/2012 18:05 -----


From:        
Tim Kimber/UK/IBM
To:        
Steve Hanson/UK/IBM@IBMGB
Date:        
15/05/2012 12:27
Subject:        
Fw: DFDL and the truncated SAP File IDoc format



I've thought about this a bit more...


The already-existing rule about lengthKind=delimited versus lengthKind=implicit only applies when the parser is about to parse the content region of an element, and needs to decide whether to recurse into its content. If the elements own lengthKind is 'delimited' then it does not recurse. The rule that you are proposing goes further than that, and requires that lengthKind=delimited is taken literally; the length of the complex element truly is defined by the in-scope delimiters, including its own terminator. I like that rule, actually - it gives real meaning to lengthKind=delimited. The problem is defining the behaviour, because the rule has implications for the parsing of the element's group. Before parsing each member of the group ( required or not, I think) , the parser must check for in-scope delimiters. This only needs to happen if the immediate parent of the group is an element with lengthKind=delimited or endOfParent. I'm sure there are edge cases around this ( what about embedded groups ) so we should discuss this with Mike.


regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  
kimbert@uk.ibm.com
Tel. 01962-816742  
Internal tel. 246742


----- Forwarded by Tim Kimber/UK/IBM on 15/05/2012 12:12 -----


From:        
Tim Kimber/UK/IBM
To:        
Steve Hanson/UK/IBM@IBMGB
Date:        
15/05/2012 11:19
Subject:        
Re: DFDL and the truncated SAP File IDoc format



 When a modeller sets lengthKind to 'delimited' they are implicitly claiming that the element's content region will not contain any of the in-scope delimiters ( unless they are escaped ). That makes it safe for the parser to look for *all* in-scope delimiters when scanning. When they set lengthKind='explicit' they are not making any such claim. Well...nearly.  We already have a rule in DFDL that distinguishes between a strict behaviour when lengthKind='implicit' a lax-but-more-efficient behaviour when lengthKind=delimited. I think that may be the justification for your rule.


This has prompted me to think about how we discuss this delimited/implicit distinction in the DFDL specification. I think it might be useful to cast the discussion in terms of what is allowed in the content of the element. If the parser might encounter the already-in-scope delimiters as part of its content ( either within explicit-length fields or as the delimiters of child elements/groups )  then lengthKind must be 'implicit'. If the parser can safely assume that delimiters never occur within the element's content, or that they are always escaped, then lengthKind='delimited' is the better choice.


regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  
kimbert@uk.ibm.com
Tel. 01962-816742  
Internal tel. 246742





From:        
Steve Hanson/UK/IBM
To:        
Tim Kimber/UK/IBM
Date:        
15/05/2012 09:20
Subject:        
DFDL and the truncated SAP File IDoc format



Hi Tim


Looking at Emma's format got me thinking about errata 3.3.


3.3.
Section 12.3. Clarify that when property is lengthKind 'explicit', 'implicit' (simple only), 'prefixed' or 'pattern', it means that delimiter scanning is turned off and in-scope delimiters are not looked for within or between elements.

I am absolutely clear on why the parser would not want to look for in-scope delimiters within such elements. I'm also happy not to look for delimiters between elements if the element is required. But why shouldn't the parser look between elements when the element is optional?  Or at least when the remaining content is all optional?  There's an analogy here with trailing separator suppression, that I don't think we spotted before.  Were we worried that users would be using unescaped characters because the data is fixed length?


If my format was some required fixed length fields followed by some optional fixed length fields, with an indicator for end of record, I would like to be able to model it very simply, as follows.  


<xs:element name="record" dfdl:lengthKind="delimited" dfdl:terminator="%LF;" >

  <xs:complexType>

    <xs:sequence>

      <xs:element name="A" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="10" />

      <xs:element name="B" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="10" />

      <xs:element name="C" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="10" minOccurs="0"  />

      <xs:element name="D" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="10" minOccurs="0" />

      <xs:element name="E" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="10" minOccurs="0" />

    </xs:sequence>

  </xs:complexType>

</xs:element>


If DFDL doesn't allow this it means I need either dfdl:lengthKind="pattern" on the record element, or I need an assert on each element checking the content is not line feed.

You can argue that using 'pattern' instead of 'delimited' is no big deal, but using 'delimited' is a more natural fit and what a modeler would think of first.


Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair,
OGF DFDL Working Group
IBM SWG, Hursley, UK

smh@uk.ibm.com
tel:
+44-1962-815848

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU




--
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel: 
781-330-0412


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU