validating expressions on elements in a choice or unordered sequence

When we were implementing unordered sequences, this raised some questions around evaluating relative paths in expressions, for elements in a choice or unordered sequence : DFDL spec: (gwdrp-dfdl-v1.0.4 section 15) "When processing a choice group the parser validates any contained path expressions. If a path expression contained inside a choice branch refers to any other branch of the choice, then it is a schema definition error." 1. I'm not clear what benefit this restriction on path expressions gives. It seems redundant since in any single instance of a choice group, if the branch being processed exists, then by definition none of it's sibling branches exist. Any expression path referring to a non-existent branch would correctly return <empty sequence> If the choice group is inside a repeating structure, then expressions referring to choice branches within other instances of the choice could be useful. Should an expression referring to branches in other instances of a choice cause a schemadef error? Example expression on el_b could be { fn:count(../../el_choice/el_a) } - parent [sequence] - el_choice [minOccurs=5 maxOccurs=5] [choice] - el_a - el_b 2. Should an expression that potentially refers to branches in the choice cause a schemadef error? Example identically named elements in and out of a choice expression on el_c could be { fn:count(../el_a) } - parent [sequence] - el_a - el_b - [embedded choice group] - el_a - el_c Regards, Mark Frost _____________________________________ MP 211, IBM Hursley, Winchester, SO21 2JN Phone: (01962) 817009 or x247009 Email: frostmar@uk.ibm.com Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Comments inline On Fri, Apr 11, 2014 at 6:22 AM, Mark Frost <FROSTMAR@uk.ibm.com> wrote:
When we were implementing unordered sequences, this raised some questions around evaluating relative paths in expressions, for elements in a choice or unordered sequence :
DFDL spec: (gwdrp-dfdl-v1.0.4 section 15) * "When processing a choice group the parser validates* *any contained path expressions. If a path * *expression contained inside a choice branch refers * *to any other branch of the choice, then it is a * *schema definition error."*
1. I'm not clear what benefit this restriction on path expressions gives. It seems redundant since in any single instance of a choice group, if the branch being processed exists, then by definition none of it's sibling branches exist. Any expression path referring to a non-existent branch would correctly return <*empty sequence*>
Typically in XPath, such paths would just be empty-sequence at runtime. Making it an SDE hoists the error to (hopefully) compile time, and making it SDE (non-recoverable) changes the way one must write expressions. You can't write utter nonsense paths and have them be runnable.
If the choice group is inside a repeating structure, then expressions referring to choice branches within *other *instances of the choice could be useful. Should an expression referring to branches in *other instances* of a choice cause a schemadef error?
Should be no issue if you are looking at say, position() - n. If you reach to something that doesn't exist, then you'll get empty sequence. My experience so far with XPath is that this notion that non-existance returns empty sequence is painful at best and a nightmare at worst. Expressions that are utterly nonsense are accepted executed, and silently fail by returning empty sequence. The most common mistake is writing /a/b/c when you needed /ns1:a/ns2:b/ns3:c.
Example expression on el_b could be { fn:count(../../el_choice/el_a) }
- parent [sequence] - el_choice [minOccurs=5 maxOccurs=5] [choice] - el_a - el_b
2. Should an expression that *potentially *refers to branches in the choice cause a schemadef error?
Example identically named elements in and out of a choice expression on el_c could be { fn:count(../el_a) }
- parent [sequence] - *el_a* - el_b - [embedded choice group] - *el_a* - el_c
I'd love to restrict this, because we're looking at having to create a DFDL expression language implementation for performance reasons, and complex things like this require a very complex implementation tantamount to a query-engine. I would claim that these two el_a elements are different, and we could choose to restrict a DFDL path expression to return only nodes described by the same schema component, with "same schema component" meaning same path from document element to the schema component where an element or group or type reference counts as part of that path. So two different element references to the same global element would be two different schema components. But I suspect that this is too restrictive, and implementations are just going to have to be sophisticated enough to execute queries like this one, and a good implementation will optimize simpler cases for faster execution. ...mikeb

I would be quite uncomfortable with DFDL not being a 'proper subset' of XPath 2.0. I understand the motivation ( having personally been involved in coding a query engine for DFDL ) but I think the cure would be worse than the complaint. Consistent with that, I think I agree with Mark's suggestion - a DFDL processor should just 'do what an XPath processor would do'. regards, Tim Kimber, IBM Integration Bus Development (Industry Packs) Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Mark Frost/UK/IBM@IBMGB, Cc: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org> Date: 11/04/2014 13:23 Subject: Re: [DFDL-WG] validating expressions on elements in a choice or unordered sequence Sent by: dfdl-wg-bounces@ogf.org Comments inline On Fri, Apr 11, 2014 at 6:22 AM, Mark Frost <FROSTMAR@uk.ibm.com> wrote: When we were implementing unordered sequences, this raised some questions around evaluating relative paths in expressions, for elements in a choice or unordered sequence : DFDL spec: (gwdrp-dfdl-v1.0.4 section 15) "When processing a choice group the parser validates any contained path expressions. If a path expression contained inside a choice branch refers to any other branch of the choice, then it is a schema definition error." 1. I'm not clear what benefit this restriction on path expressions gives. It seems redundant since in any single instance of a choice group, if the branch being processed exists, then by definition none of it's sibling branches exist. Any expression path referring to a non-existent branch would correctly return <empty sequence> Typically in XPath, such paths would just be empty-sequence at runtime. Making it an SDE hoists the error to (hopefully) compile time, and making it SDE (non-recoverable) changes the way one must write expressions. You can't write utter nonsense paths and have them be runnable. If the choice group is inside a repeating structure, then expressions referring to choice branches within other instances of the choice could be useful. Should an expression referring to branches in other instances of a choice cause a schemadef error? Should be no issue if you are looking at say, position() - n. If you reach to something that doesn't exist, then you'll get empty sequence. My experience so far with XPath is that this notion that non-existance returns empty sequence is painful at best and a nightmare at worst. Expressions that are utterly nonsense are accepted executed, and silently fail by returning empty sequence. The most common mistake is writing /a/b/c when you needed /ns1:a/ns2:b/ns3:c. Example expression on el_b could be { fn:count(../../el_choice/el_a) } - parent [sequence] - el_choice [minOccurs=5 maxOccurs=5] [choice] - el_a - el_b 2. Should an expression that potentially refers to branches in the choice cause a schemadef error? Example identically named elements in and out of a choice expression on el_c could be { fn:count(../el_a) } - parent [sequence] - el_a - el_b - [embedded choice group] - el_a - el_c I'd love to restrict this, because we're looking at having to create a DFDL expression language implementation for performance reasons, and complex things like this require a very complex implementation tantamount to a query-engine. I would claim that these two el_a elements are different, and we could choose to restrict a DFDL path expression to return only nodes described by the same schema component, with "same schema component" meaning same path from document element to the schema component where an element or group or type reference counts as part of that path. So two different element references to the same global element would be two different schema components. But I suspect that this is too restrictive, and implementations are just going to have to be sophisticated enough to execute queries like this one, and a good implementation will optimize simpler cases for faster execution. ...mikeb-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

It is certainly easier if we can just do the same as XPath 2.0 stipulates. But I think that this misses the point here. The XPath error for statically detecting that an expression refers to something that can never exist is XPST0008, which says: It is a static error if an expression refers to an element name, attribute name, schema type name, namespace prefix, or variable name that is not defined in the static context, except for an ElementName in an ElementTest or an AttributeName in an AttributeTest. The static context has the notion of "In-scope schema definitions" being " a generic term for all the element declarations, attribute declarations, and schema type definitions that are in scope during processing of an expression.". It doesn't define exactly what is meant by "in-scope" but XPath assumes that it acts on a complete instance of an XDM. In DFDL we are different to typical XPath usage as we are applying expressions during parsing when the document is incomplete. We can use that as the justification for applying extra constraints, which is exactly why there are additional rules in section 23.1. So, if there are scenarios where a rule is going to be restrictive then we should consider dropping it. If there are not, but it makes the life of an implementer harder because it is hard to code the rule, then we should consider dropping it. Otherwise keep it. Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Tim Kimber/UK/IBM@IBMGB To: dfdl-wg@ogf.org, Date: 11/04/2014 14:03 Subject: Re: [DFDL-WG] validating expressions on elements in a choice or unordered sequence Sent by: dfdl-wg-bounces@ogf.org I would be quite uncomfortable with DFDL not being a 'proper subset' of XPath 2.0. I understand the motivation ( having personally been involved in coding a query engine for DFDL ) but I think the cure would be worse than the complaint. Consistent with that, I think I agree with Mark's suggestion - a DFDL processor should just 'do what an XPath processor would do'. regards, Tim Kimber, IBM Integration Bus Development (Industry Packs) Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Mark Frost/UK/IBM@IBMGB, Cc: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org> Date: 11/04/2014 13:23 Subject: Re: [DFDL-WG] validating expressions on elements in a choice or unordered sequence Sent by: dfdl-wg-bounces@ogf.org Comments inline On Fri, Apr 11, 2014 at 6:22 AM, Mark Frost <FROSTMAR@uk.ibm.com> wrote: When we were implementing unordered sequences, this raised some questions around evaluating relative paths in expressions, for elements in a choice or unordered sequence : DFDL spec: (gwdrp-dfdl-v1.0.4 section 15) "When processing a choice group the parser validates any contained path expressions. If a path expression contained inside a choice branch refers to any other branch of the choice, then it is a schema definition error." 1. I'm not clear what benefit this restriction on path expressions gives. It seems redundant since in any single instance of a choice group, if the branch being processed exists, then by definition none of it's sibling branches exist. Any expression path referring to a non-existent branch would correctly return <empty sequence> Typically in XPath, such paths would just be empty-sequence at runtime. Making it an SDE hoists the error to (hopefully) compile time, and making it SDE (non-recoverable) changes the way one must write expressions. You can't write utter nonsense paths and have them be runnable. If the choice group is inside a repeating structure, then expressions referring to choice branches within other instances of the choice could be useful. Should an expression referring to branches in other instances of a choice cause a schemadef error? Should be no issue if you are looking at say, position() - n. If you reach to something that doesn't exist, then you'll get empty sequence. My experience so far with XPath is that this notion that non-existance returns empty sequence is painful at best and a nightmare at worst. Expressions that are utterly nonsense are accepted executed, and silently fail by returning empty sequence. The most common mistake is writing /a/b/c when you needed /ns1:a/ns2:b/ns3:c. Example expression on el_b could be { fn:count(../../el_choice/el_a) } - parent [sequence] - el_choice [minOccurs=5 maxOccurs=5] [choice] - el_a - el_b 2. Should an expression that potentially refers to branches in the choice cause a schemadef error? Example identically named elements in and out of a choice expression on el_c could be { fn:count(../el_a) } - parent [sequence] - el_a - el_b - [embedded choice group] - el_a - el_c I'd love to restrict this, because we're looking at having to create a DFDL expression language implementation for performance reasons, and complex things like this require a very complex implementation tantamount to a query-engine. I would claim that these two el_a elements are different, and we could choose to restrict a DFDL path expression to return only nodes described by the same schema component, with "same schema component" meaning same path from document element to the schema component where an element or group or type reference counts as part of that path. So two different element references to the same global element would be two different schema components. But I suspect that this is too restrictive, and implementations are just going to have to be sophisticated enough to execute queries like this one, and a good implementation will optimize simpler cases for faster execution. ...mikeb-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Just curious, has anyone in the group compared DFDL with smooks, pro and cons? Also I were to parse the DFDL using IBM DFDL code, is there a setting that I can call to do partial parsing? And when parsing happens, does it stop on the first parsing error or it tries to parse as much as possible to include all the errors encountered? thanks

On 2014-04-29, Bing Lu wrote:
Also I were to parse the DFDL using IBM DFDL code, is there a setting that I can call to do partial parsing? And when parsing happens, does it stop on the first parsing error or it tries to parse as much as possible to include all the errors encountered?
I too would like to see both the proper upper and lower parsing complexity of the language proven. Not asymptotically, but over the minor oh. After that, I'd really like to see the best known constants as of last week, and the whole rationale why the language took this precise, hypermicromanaged route where you have to ask about the existence of a formal grammar in the first place. Seriously, I've pretty much been unemployed for two years after you already did two years of hard work. I'm still not seeing the promise even in IBM's code. What is this, another SGML?!? When is this thing going to parse my favourite, random format, out of the box? -- Sampo Syreeni, aka decoy - decoy@iki.fi, http://decoy.iki.fi/front +358-40-3255353, 025E D175 ABE5 027C 9494 EEB0 E090 8BA9 0509 85C2

I am not aware of a formal comparison with smooks. For partial parsing you use a loop of parseNext() calls instead of a single parseAll() call. That puts you in control of how far down the data you parse. In the Java sample provided with IBM DFDL, the code to do this is present but commented out. Two things can happen when a processing error occurs. If inside a point of uncertainty, such as a choice branch or an optional element, then the processing error is taken to indicate that the component does not exist, and the parser then backtracks and tries an alternative (so the error is suppressed). If not inside a point of uncertainty, then IBM's parser currently treats that as a fatal error and stops the parse. The spec (section 2.1) allows for more creative behaviour: It is expected that DFDL implementations will provide additional mechanisms for dealing with effective processing errors, such as the means of specifying retry points or the means of skipping some data so as to recover from the error in some way. The DFDL specification language does not provide features for specify such mechanisms If the data is well-formed (ie, no processing error occurs), then switching on validation will report all validation errors (section 2.5). It is possible using an assert to throw a recoverable error after which the parser will continue (section 2.5). Regards Steve Hanson Architect, IBM DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Bing Lu <mfcplus@yahoo.com> To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, Date: 30/04/2014 02:11 Subject: [DFDL-WG] questions Sent by: dfdl-wg-bounces@ogf.org Just curious, has anyone in the group compared DFDL with smooks, pro and cons? Also I were to parse the DFDL using IBM DFDL code, is there a setting that I can call to do partial parsing? And when parsing happens, does it stop on the first parsing error or it tries to parse as much as possible to include all the errors encountered? thanks-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (6)
-
Bing Lu
-
Mark Frost
-
Mike Beckerle
-
Sampo Syreeni
-
Steve Hanson
-
Tim Kimber