June 2012 - dfdl-wg - lists.ogf.org

Fw: Actiion 173: DFDL String Literals Analysis for review
by Steve Hanson 18 Jun '12

18 Jun '12

I have reviewed and updated Mike's spreadsheet at the link below. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 18/06/2012 15:59 ----- From: Steve Hanson/UK/IBM To: dfdl-wg(a)ogf.org Date: 23/05/2012 18:11 Subject: Actiion 173: DFDL String Literals Analysis for review ----- Forwarded by Steve Hanson/UK/IBM on 23/05/2012 18:10 ----- From: "Mike Beckerle (Google Docs)" <mbeckerle.dfdl(a)gmail.com> To: Steve Hanson/UK/IBM@IBMGB Cc: Tim Kimber/UK/IBM@IBMGB Date: 21/04/2012 17:40 Subject: String Literals Analysis (smh(a)uk.ibm.com) I've shared String Literals Analysis Message from mbeckerle.dfdl(a)gmail.com: Online editable spreadsheet of all the various string-literal situations. This is better than that UML mixin diagram. Click to open: String Literals Analysis Google Docs makes it easy to create, store and share online documents, spreadsheets and presentations. Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

DFDL spec & error codes (was Fw: TDML question)
by Steve Hanson 18 Jun '12

18 Jun '12

Added to WG agenda. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 18/06/2012 09:38 ----- From: Steve Hanson/UK/IBM To: Tim Kimber/UK/IBM@IBMGB Cc: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> Date: 14/06/2012 09:19 Subject: Re: TDML question We should discuss this at the WG when we have got our current backlog and action 140 out of the way. It's all part of the deferred conformance suite action 166. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Tim Kimber/UK/IBM To: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> Cc: Steve Hanson/UK/IBM@IBMGB Date: 13/06/2012 21:42 Subject: Re: TDML question Hi Mike, Fair question. IBM requires all diagnostic messages from its software products to be identified by a unique error code. In the IBM test driver program, it is this code that we check for. The XML Schema specification actually assigns unique strings to the various types of error that can occur ( e.g. cvc-* ). If the DFDL specification did the same then the TDML format would be able to specify that the content of the <errors> tag is a list of defined error codes. For DFDL v1.0, I think implementers are free to use it in a way that fits their own requirements. regards, Tim Kimber, Common Transformation Team, Hursley, UK Internet: kimbert(a)uk.ibm.com Tel. 01962-816742 Internal tel. 246742 From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> To: Steve Hanson/UK/IBM@IBMGB, Tim Kimber/UK/IBM@IBMGB Date: 13/06/2012 20:45 Subject: TDML question I'm enhancing the daffodil TDML runner. I want to keep it compatible with the IBM one so that we can use TDML files as an interchange medium for discussing bugs/semantics/etc. I added the error feature where one can put expected errors into the TDML file. I have a question about the <error>...</error> element. Which is this: How is the string contents of these error elements used? I tentatively just have it search the error messages created by the parse for these error strings. If any actual error message contains the error string, then that error "passes". Here's my example, which is parsing a 2 character integer, which will fail because the text is AA, and this is base 10. <ts:testSuite xmlns:ts={ tdml } suiteName="theSuiteName"> <ts:parserTestCase ID="some identifier" name="firstUnitTest" root="data"> <ts:document>AA</ts:document> <ts:errors> <ts:error>convert</ts:error>  <ts:error>xs:int</ts:error>  </ts:errors> </ts:parserTestCase> </ts:testSuite> So my test passes so long as the words "convert" and "xs:int" are found in the error message that is generated. Is this consistent with your TDML file usage? ...mikeb -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

OGF DFDL WG Call Agenda 2012-06-18/19
by Steve Hanson 18 Jun '12

18 Jun '12

Please find agenda for the above call on GridForge at https://forge.ogf.org/sf/docman/do/downloadDocument/projects.dfdl-wg/docman… We have quite a few items building up, so there are two calls this week, with a single agenda as per link. Mon 18th @ 16:00 UK (extra call) Tues 19th @ 15:00 UK (regular time) Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848

1 0

OGF DFDL WG Call Minutes 2012-06-12
by Steve Hanson 12 Jun '12

12 Jun '12

Please find minutes from the above call on GridForge at https://forge.ogf.org/sf/docman/do/downloadDocument/projects.dfdl-wg/docman… Extra call next Monday to progress agenda backlog. Regards Steve Hanson Architect, DFDL, IBM SWG Co-Chair, OGF DFDL Working Group Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848

1 0

Invitation: OGF DFDL Working Group weekly extra call (18 Jun 16:00 GDT in IBM Hursley DE3S09)
by Steve Hanson 12 Jun '12

12 Jun '12

OGF DFDL WG weekly call dial-in details. Passcode for Participants: 5381214 Canada Toll-Free 888-426-6840 China Toll-Free 10-800-711-1071 CHINA NETCOM GROUP USERS China Toll-Free 10-800-110-0996 CHINA TELECOM SOUTH USERS France Toll-Free 0800-94-0558 Germany Toll-Free 0800-000-1018 India Toll-Free 000-800-100-1176 Ireland Toll-Free 1-800-943-427 Israel Toll-Free 1-809-417-783 United Kingdom Caller Paid 0-20-30596451 United Kingdom Toll-Free 0800-368-0638 USA Caller Paid 215-861-6239 USA Toll-Free 888-426-6840 Other international numbers available - e-mail smh(a)uk.ibm.com. OGF DFDL Home: http://www.ogf.org/dfdl GridForge DFDL: http://forge.ogf.org/projects/dfdl-wg/

1 0

Re: [DFDL-WG] Action 174: Making DFDL implementations easier
by Steve Hanson 12 Jun '12

12 Jun '12

Interesting, but there's a problem - variables are currently an optional feature! You could swap variables for paths, which is what I think Mike proposed in an earlier mail? exprSubset = / | / exprList atom = . | .. | identifier exprList = atom | atom / exprList That's essentially equivalent to what MRM provides for its 'repeat reference' and 'length reference' properties. It doesn't handle things like an array count being 0 based, where you need the ability to add 1 to the result. And you can't do a simple comparison, which would make it easy to add choices and discriminators to an implementation. The trouble is that once you add in operators and literals you've pulled in a lot of the language. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Tim Kimber/UK/IBM To: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> Cc: dfdl-wg(a)ogf.org, dfdl-wg-bounces(a)ogf.org, Steve Hanson/UK/IBM@IBMGB Date: 25/05/2012 14:29 Subject: Re: [DFDL-WG] Action 174: Making DFDL implementations easier My current thinking is: - Minimum expression language is a simple variable reference. e.g. {$myLength} or {$myParentStructureLength}. - That would require the ability to declare DFDL variables, set them, read them and put them into and out of scope. And restore their value after backtracking if backtracking is supported in the implementation. This would make all the XPath functions optional, and would also make it unnecessary to use an XPath query engine. regards, Tim Kimber, Common Transformation Team, Hursley, UK Internet: kimbert(a)uk.ibm.com Tel. 01962-816742 Internal tel. 246742 From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> To: Steve Hanson/UK/IBM@IBMGB Cc: Tim Kimber/UK/IBM@IBMGB, dfdl-wg(a)ogf.org, dfdl-wg-bounces(a)ogf.org Date: 25/05/2012 13:58 Subject: Re: [DFDL-WG] Action 174: Making DFDL implementations easier But the expression language is highly restricted in the subset. On Fri, May 25, 2012 at 7:59 AM, Steve Hanson <smh(a)uk.ibm.com> wrote: Tim I tend to agree with endOfParent as optional. Expression language was kept in the core to handle occursCountKind 'expression' which is also in the core. The examples of binary data we have seen from NSA and ESA both have occurs counts in the data, in the same way as COBOL does. When you have untagged binary data, it's typically the way to provide an array size. It was felt that dropping this from the core did left us with too little capability. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Tim Kimber/UK/IBM To: Steve Hanson/UK/IBM@IBMGB Cc: dfdl-wg(a)ogf.org, dfdl-wg-bounces(a)ogf.org Date: 25/05/2012 10:07 Subject: Re: [DFDL-WG] Action 174: Making DFDL implementations easier 1) I would make endOfParent an optional feature - there are not many formats that require it. 3) There are many formats that do not require the expression language - it is only required when a property value or an assert/discriminator needs to query already-parsed data. On that basis, I think the entire expression language feature should be optional. regards, Tim Kimber, Common Transformation Team, Hursley, UK Internet: kimbert(a)uk.ibm.com Tel. 01962-816742 Internal tel. 246742 From: Steve Hanson/UK/IBM@IBMGB To: dfdl-wg(a)ogf.org Date: 25/05/2012 00:08 Subject: [DFDL-WG] Action 174: Making DFDL implementations easier Sent by: dfdl-wg-bounces(a)ogf.org Agreed on list, just need to answer questions 1) and 3) below. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 23/05/2012 18:21 ----- From: Steve Hanson/UK/IBM To: dfdl-wg(a)ogf.org Date: 15/05/2012 09:55 Subject: Making DFDL implementations easier Please see below for a proposal to make an additional set of DFDL features optional. The goal is to make it considerably easier to create a minimal conforming DFDL processor for binary data. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 11/05/2012 12:50 ----- Feature Detection Text representation for types other than String dfdl:representation="text" for Number, Calendar or Boolean types Delimiters dfdl:separator <> "" or dfdl:initiator <> "" or dfdl:terminator <> "" or dfdl:lengthKind="delimited" BCD calendars dfdl:binaryCalendarRep="bcd" Multiple schemas xs:include or xs:import in xsd Named Formats dfdl:defineFormat or dfdl:ref Choices xs:choice in xsd ** Arrays where size not known in advance dfdl:occursCountKind 'implicit', 'parsed', 'stopValue' ** Advanced expressions Advanced features of the DFDL expression language (tbd) ** Including one of these features mean that speculative parsing is needed. Remaining questions: 1) What about lengthKind 'endOfParent' ? 2) Is leaving out choices too restrictive? 3) Expression language subset The result is that a minimal conformant DFDL implementation just needs to support the following annotations and properties, and does not need speculative parsing. dfdl:element dfdl:sequence dfdl:format byteOrder encoding utf16width alignment alignmentUnits (bytes) fillByte leadingSkip trailingSkip lengthKind (explicit, implicit) length lengthUnits (bytes, characters) representation (binary) textPadKind textTrimKind textStringJustification textStringPadCharacter truncateSpecifiedLengthString decimalSigned binaryNumberRep binaryVirtualDecimalPoint binaryFloatRep (ieee) binaryBooleanTrueRep binaryBooleanFalseRep binaryCalendarRep (binarySeconds, binaryMilliseconds) binaryCalendarEpoch sequenceKind (ordered) occursCountKind (fixed, expression) occursCount Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg(a)ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg(a)ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

3 3

OGF DFDL WG Call Agenda 2012-06-12
by Steve Hanson 11 Jun '12

11 Jun '12

Please find agenda for the above call on GridForge at https://forge.ogf.org/sf/docman/do/downloadDocument/projects.dfdl-wg/docman… We have quite a few items building up, and could do with a second call this week. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848

1 0

Fw: DFDL and the truncated SAP File IDoc format
by Steve Hanson 11 Jun '12

11 Jun '12

For next DFDL WG call. Some thoughts on whether lengthKind 'delimited' should be able to model this without resorting to asserts. Read from bottom. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 11/06/2012 11:50 ----- From: Tim Kimber/UK/IBM To: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> Cc: Steve Hanson/UK/IBM@IBMGB Date: 30/05/2012 21:23 Subject: Re: Fw: DFDL and the truncated SAP File IDoc format Thanks Mike - useful input. I've added my comments in <tk> tags regards, Tim Kimber, Common Transformation Team, Hursley, UK Internet: kimbert(a)uk.ibm.com Tel. 01962-816742 Internal tel. 246742 From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> To: Steve Hanson/UK/IBM@IBMGB Cc: Tim Kimber/UK/IBM@IBMGB Date: 30/05/2012 19:59 Subject: Re: Fw: DFDL and the truncated SAP File IDoc format Hmmm. We discussed at one time whether there are actually 2 different delimiting schemes. One is what we have now. Let me call this "delimited1". In delimited1, an enclosing parent's delimiter cannot be used in isolation to find the extent of the data, because child elements might have escape schemes defined which escape even the parent delimiter, so you still have to use the recursive definition of the children when parsing. This is a very powerful mode of parsing. However, many things that might be errors (putting a binary field in the middle of a bunch of text fields), would be tolerated by this regime, because scanning would be turned on/off appropriately. I struggle, however, with whether delimited1 is really the same thing as "implicit". I mean if you define an element as 'implicit' but it has a terminator, then after you unwind from the recursion you are still going to then look for the terminator, so it's not like the delimiters are being ignored. <tk> I think of it this way. The lengthKind property is about the length of the *content* region. So 'delimited1' is, I think, the same as 'implicit' for the purposes of finding the length of the content region. If the complex element has a terminator then the terminator will be expected at the byte offset that immediately follows the end of the content region - whether lengthKind is 'delimited' or 'implicit'. In other words, I'm modifying your description of the behaviour to "after you unwind from the recursion you are still going to then look for the terminator at the byte offset immediately following the element's content" </tk> The other definition of delimited (let's called it delimited2), would be where you get to completely disregard the children when searching for the parent delimiter. Many things appearing within the children would be SDE. E.g., binary format children would be an SDE, etc. Delimited2 would imply that the children are all representation="text", and the scan for the parent delimiter would be irrespective of any delimiters and escape schemes being put in place by child elements. So for example, the last child inside a delimited2 parent could have length kind = "endOfData" just fine, because we can isolate the "box" of data first, and then parse the children within it, with the last child extending to the end of the "box". <tk> You mean 'endOfParent' but it doesn;t change your point, which is valid. My concern with your description is the implication that the parser needs to scan the same data multiple times. Maybe there are ways to analyse the model and avoid that necessity for many types of model, but that may be easier said than done. My proposal was to respect the lengthKind of each child element within the parent delimited element, but to check for the terminator of the element, of its main group, and for any other enclosing terminating delimiters before continuing to parse any member of the group. I'm prepared to be convinced that this approach is shot full of logical inconsistencies, btw. </tk> ...mike On Wed, May 30, 2012 at 1:10 PM, Steve Hanson <smh(a)uk.ibm.com> wrote: Hi Mike Interested in your opinion on this one...it was prompted by looking at the best way to model a format where each record consisted of fixed length optional fields 1 to n followed by an EOR indicator, where missing trailing fields are suppressed. Kind of analogous to suppressing trailing delimiters for empty fields. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 30/05/2012 18:05 ----- From: Tim Kimber/UK/IBM To: Steve Hanson/UK/IBM@IBMGB Date: 15/05/2012 12:27 Subject: Fw: DFDL and the truncated SAP File IDoc format I've thought about this a bit more... The already-existing rule about lengthKind=delimited versus lengthKind=implicit only applies when the parser is about to parse the content region of an element, and needs to decide whether to recurse into its content. If the elements own lengthKind is 'delimited' then it does not recurse. The rule that you are proposing goes further than that, and requires that lengthKind=delimited is taken literally; the length of the complex element truly is defined by the in-scope delimiters, including its own terminator. I like that rule, actually - it gives real meaning to lengthKind=delimited. The problem is defining the behaviour, because the rule has implications for the parsing of the element's group. Before parsing each member of the group ( required or not, I think) , the parser must check for in-scope delimiters. This only needs to happen if the immediate parent of the group is an element with lengthKind=delimited or endOfParent. I'm sure there are edge cases around this ( what about embedded groups ) so we should discuss this with Mike. regards, Tim Kimber, Common Transformation Team, Hursley, UK Internet: kimbert(a)uk.ibm.com Tel. 01962-816742 Internal tel. 246742 ----- Forwarded by Tim Kimber/UK/IBM on 15/05/2012 12:12 ----- From: Tim Kimber/UK/IBM To: Steve Hanson/UK/IBM@IBMGB Date: 15/05/2012 11:19 Subject: Re: DFDL and the truncated SAP File IDoc format When a modeller sets lengthKind to 'delimited' they are implicitly claiming that the element's content region will not contain any of the in-scope delimiters ( unless they are escaped ). That makes it safe for the parser to look for *all* in-scope delimiters when scanning. When they set lengthKind='explicit' they are not making any such claim. Well...nearly. We already have a rule in DFDL that distinguishes between a strict behaviour when lengthKind='implicit' a lax-but-more-efficient behaviour when lengthKind=delimited. I think that may be the justification for your rule. This has prompted me to think about how we discuss this delimited/implicit distinction in the DFDL specification. I think it might be useful to cast the discussion in terms of what is allowed in the content of the element. If the parser might encounter the already-in-scope delimiters as part of its content ( either within explicit-length fields or as the delimiters of child elements/groups ) then lengthKind must be 'implicit'. If the parser can safely assume that delimiters never occur within the element's content, or that they are always escaped, then lengthKind='delimited' is the better choice. regards, Tim Kimber, Common Transformation Team, Hursley, UK Internet: kimbert(a)uk.ibm.com Tel. 01962-816742 Internal tel. 246742 From: Steve Hanson/UK/IBM To: Tim Kimber/UK/IBM Date: 15/05/2012 09:20 Subject: DFDL and the truncated SAP File IDoc format Hi Tim Looking at Emma's format got me thinking about errata 3.3. 3.3. Section 12.3. Clarify that when property is lengthKind 'explicit', 'implicit' (simple only), 'prefixed' or 'pattern', it means that delimiter scanning is turned off and in-scope delimiters are not looked for within or between elements. I am absolutely clear on why the parser would not want to look for in-scope delimiters within such elements. I'm also happy not to look for delimiters between elements if the element is required. But why shouldn't the parser look between elements when the element is optional? Or at least when the remaining content is all optional? There's an analogy here with trailing separator suppression, that I don't think we spotted before. Were we worried that users would be using unescaped characters because the data is fixed length? If my format was some required fixed length fields followed by some optional fixed length fields, with an indicator for end of record, I would like to be able to model it very simply, as follows. <xs:element name="record" dfdl:lengthKind="delimited" dfdl:terminator="%LF;" > <xs:complexType> <xs:sequence> <xs:element name="A" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="10" /> <xs:element name="B" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="10" /> <xs:element name="C" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="10" minOccurs="0" /> <xs:element name="D" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="10" minOccurs="0" /> <xs:element name="E" type="xs:string" dfdl:lengthKind="explicit" dfdl:length="10" minOccurs="0" /> </xs:sequence> </xs:complexType> </xs:element> If DFDL doesn't allow this it means I need either dfdl:lengthKind="pattern" on the record element, or I need an assert on each element checking the content is not line feed. You can argue that using 'pattern' instead of 'delimited' is no big deal, but using 'delimited' is a more natural fit and what a modeler would think of first. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0