Action 145: 'dispatch' way of discriminating a choice for better performance (updated)

The enveope/payload style of data format is quite common, where the envelope provides control information and the payload contains the business data. Examples are SWIFT and SAP IDocs. Typically the envelope contains a tag that identifies the payload, which can be one of many types. For SWIFT there are 300 possible types. To model this today in DFDL requires an xs:choice with each type modeled as an xs:element branch of the choice. A discriminator on each xs:element refers back to the envelope tag element thus enabling the choice to be resolved. There are two issues with this approach. 1) Performance. Even if the elements in the branches are ordered for expected frequency, there will still be cases when tens or hundreds of discriminators need to be evaluated before the choice is resolved. 2) Tight coupling. When a new type is added, a new element branch needs to be added to the choice. Action 145 proposes a mechanism to solve issue #1 and which opens the door to a possible extension to DFDL to solve issue #2 - namely a faster way to resolve a choice. Details: A new dfdl:choice property is added called dfdl:choiceBranchRef of type DFDL Expression. The expression must evaluate to a QName which corresponds to one of the element branches of the choice, and asserts 'known to exist' for that branch. Rules: - The property behaves like dfdl:ref and dfdl:hiddenGroupRef in that it is not possible to set a value in scope by a dfdl:format annotation, and is only set at its point of use. This is because there is nothing sensible that could be set in scope. But it has the benefit that adding support for the property to existing DFDL implementations will not suddenly cause errors to appear in existing DFDL schemas. - Empty string is not an allowed value. - The property is only used when parsing. - All branches must be local elements or element references. It is a schema definition error if any branch is a sequence, a choice or a group reference. - It is a processing error if the QName does not resolve to one of the branches when parsing.. - It is a schema definition error if a choice has the property set and also has dfdl:initiatedContent="yes" set locally. - Because the expression must return a QName, the expression language must provide a constructor for creating a QName from a string. XPath 2.0 provides such a function, xs:QName(), it's just not in the DFDL subset today. The string must be a lexical QName, ie, <prefix>:<name> and the prefix must be bound in what XPath calls the 'static context'. - DFDL should also include the XPath 2.0 function fn:QName() in its subset. This creates a QName from a namespace string and a name string. If you take SWIFT MT103 payload as an example, the tag in the envelope says "103" but a DFDL schema would actually model the global MT103 element with name "Document" and namespace ="urn:swift:xsd:fin.103.2011". So the dfdl:choiceBranchRef expression would have to look like: {fn:QName(fn:concat(fn:concat('urn:swift:xsd:fin.', FinMessage/Block2/MessageType), ".2011"), 'Document')} So we now have the ability to derive a QName and apply it before we start to process a choice. That makes the processing time for each branch of the choice independent of its order in the schema. We still have issue #2 so when a new payload is added, a new branch must be added to the choice. A solution to this is to allows xs:any wildcard elements back into DFDL, then provide a property dfdl:wildcardRef which works in the same way as dfdl:choiceRef. So at the point of encountering the wildcard we know its resolution in the schema. This obviously will require some further discussion, but you can see how this ability to evaluate an expression and return a QName can be used in multiple ways. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

The whole point of this thing is to be faster, not more general, so my reaction is too much XPath expression complexity here. Consider this. If the tag is some 2-character code that does NOT want to be the same as the element names (for example because they're digits, so they can't be the element names exactly since element names have to begin with an alpha char. Digits also aren't useful from readability perspective as names), then you'll need a big lookup table in the choiceBranchRef expression that translates from the codes to the QNames. We don't have a case statement in the expression language, so you've just moved the big linear evaluation chain out of evaluating choice discriminators one after another and into a big if-then-else nest in the choiceBranchRef expression. I don't see a performance gain here. I suggest dropping the QName stuff, and requiring a dfdl:choiceID property on the elements that is an NCName. (Well, we might want those QName functions anyway in the expression language. But I wouldn't use them for this rapid choice dispatch feature. You could certainly use them in discriminators. ) The expression would then have to evaluate to a value that is matched against this choiceID. I suggest exact match, not respecting ignoreCase for example. That eliminates all the QName complexity and is amenable to high-speed compact lookup table implementation. I tend to think the element names want to be a little bit more descriptive than these tag values would want to be so using the element names as the tags feels undesirable to me. Particularly because we want the tags to be conveniently computed, for example by just grabbing a fixed-length string out of a data field. You end up with something like this: <element name="tag" type="string" dfdl:length="{ 2 }" ..../> <choice dfdl:choiceBranchRef="{ ../tag }"> <element name="someName" dfdl:choiceID="02" .../> <element name="anotherName" dfdl:choiceID="73" .../> .... </choice> As for the wild-card issue. I think we can finesse this. Consider this model: <element name="tag" type="string" dfdl:length="{ 2 }" ..../> <choice> <!-- fast dispatch for known record types --> <choice dfdl:choiceBranchRef="{ ../tag }"> <element name="someName" dfdl:choiceID="02" .../> <element name="anotherName" dfdl:choiceID="73" .../> .... </choice> <!-- wildcard --> <element name="extensionRecord"> <complexType> <sequence> <!-- keep tag copy in the extension --> <element name="extensionType" type="string" dfdl:inputValueCalc="../../tag" .../> .... </sequence> </complexType> </element> </choice> The inner choice uses the fast dispatch. The outer choice lets me also have an alternative that absorbs a more general syntax to provide some way for a user to model their extensions to the choice set. The extensionType field captures the tag and stores it inside the extension record where it won't get disassociated. The user's "extensionRecord" would not be a special DFDL wildcard construct, just an element format they create which is general enough to parse their extensions. Or this extension could be predefined as part of a standard, to accept any standard-defined acceptable extension record a user might need so long as there is some set of rules all extension records must obey. Given this, do we really need special wildcard constructs? ...mikeb On Wed, Mar 21, 2012 at 10:10 AM, Steve Hanson <smh@uk.ibm.com> wrote:
The enveope/payload style of data format is quite common, where the envelope provides control information and the payload contains the business data. Examples are SWIFT and SAP IDocs. Typically the envelope contains a tag that identifies the payload, which can be one of many types. For SWIFT there are 300 possible types. To model this today in DFDL requires an xs:choice with each type modeled as an xs:element branch of the choice. A discriminator on each xs:element refers back to the envelope tag element thus enabling the choice to be resolved.
There are two issues with this approach.
1) Performance. Even if the elements in the branches are ordered for expected frequency, there will still be cases when tens or hundreds of discriminators need to be evaluated before the choice is resolved.
2) Tight coupling. When a new type is added, a new element branch needs to be added to the choice.
Action 145 proposes a mechanism to solve issue #1 and which opens the door to a possible extension to DFDL to solve issue #2 - namely a faster way to resolve a choice.
Details:
A new dfdl:choice property is added called dfdl:choiceBranchRef of type DFDL Expression. The expression must evaluate to a QName which corresponds to one of the element branches of the choice, and asserts 'known to exist' for that branch. Rules:
- The property behaves like dfdl:ref and dfdl:hiddenGroupRef in that it is not possible to set a value in scope by a dfdl:format annotation, and is only set at its point of use. This is because there is nothing sensible that could be set in scope. But it has the benefit that adding support for the property to existing DFDL implementations will not suddenly cause errors to appear in existing DFDL schemas.
- Empty string is not an allowed value.
- The property is only used when parsing.
- All branches must be local elements or element references. It is a schema definition error if any branch is a sequence, a choice or a group reference.
- It is a processing error if the QName does not resolve to one of the branches when parsing..
- It is a schema definition error if a choice has the property set and also has dfdl:initiatedContent="yes" set locally.
- Because the expression must return a QName, the expression language must provide a constructor for creating a QName from a string. XPath 2.0 provides such a function, xs:QName(), it's just not in the DFDL subset today. The string must be a lexical QName, ie, <prefix>:<name> and the prefix must be bound in what XPath calls the 'static context'.
- DFDL should also include the XPath 2.0 function fn:QName() in its subset. This creates a QName from a namespace string and a name string. If you take SWIFT MT103 payload as an example, the tag in the envelope says "103" but a DFDL schema would actually model the global MT103 element with name "Document" and namespace ="urn:swift:xsd:fin.103.2011". So the dfdl:choiceBranchRef expression would have to look like: {fn:QName(fn:concat(fn:concat('urn:swift:xsd:fin.', FinMessage/Block2/MessageType), ".2011"), 'Document')}
So we now have the ability to derive a QName and apply it before we start to process a choice. That makes the processing time for each branch of the choice independent of its order in the schema.
We still have issue #2 so when a new payload is added, a new branch must be added to the choice. A solution to this is to allows xs:any wildcard elements back into DFDL, then provide a property dfdl:wildcardRef which works in the same way as dfdl:choiceRef. So at the point of encountering the wildcard we know its resolution in the schema. This obviously will require some further discussion, but you can see how this ability to evaluate an expression and return a QName can be used in multiple ways.
Regards
Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848
________________________________
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412
participants (2)
-
Mike Beckerle
-
Steve Hanson