Action 145: 'dispatch' way of discriminating a choice for better performance (updated)

21 Mar 2012

      The enveope/payload style of data format is quite common, where the 
envelope provides control information and the payload contains the 
business data. Examples are SWIFT and SAP IDocs. Typically the envelope 
contains a tag that identifies the payload, which can be one of many 
types. For SWIFT there are 300 possible types. To model this today in DFDL 
requires an xs:choice with each type modeled as an xs:element branch of 
the choice. A discriminator on each xs:element refers back to the envelope 
tag element thus enabling the choice to be resolved.

There are two issues with this approach.

1) Performance. Even if the elements in the branches are ordered for 
expected frequency, there will still be cases when tens or hundreds of 
discriminators need to be evaluated before the choice is resolved.

2) Tight coupling. When a new type is added, a new element branch needs to 
be added to the choice.

Action 145 proposes a mechanism to solve issue #1 and which opens the door 
to a possible extension to DFDL to solve issue #2 - namely a faster way to 
resolve a choice.

Details:

A new dfdl:choice property is added called dfdl:choiceBranchRef of type 
DFDL Expression. The expression must evaluate to a QName which corresponds 
to one of the element branches of the choice, and asserts 'known to exist' 
for that branch.  Rules:

- The property behaves like dfdl:ref and dfdl:hiddenGroupRef in that it is 
not possible to set a value in scope by a dfdl:format annotation, and is 
only set at its point of use. This is because there is nothing sensible 
that could be set in scope. But it has the benefit that adding support for 
the property to existing DFDL implementations will not suddenly cause 
errors to appear in existing DFDL schemas. 

- Empty string is not an allowed value.

- The property is only used when parsing.

- All branches must be local elements or element references. It is a 
schema definition error if any branch is a sequence, a choice or a group 
reference. 

- It is a processing error if the QName does not resolve to one of the 
branches when parsing..

- It is a schema definition error if a choice has the property set and 
also has dfdl:initiatedContent="yes" set locally.

- Because the expression must return a QName, the expression language must 
provide a constructor for creating a QName from a string. XPath 2.0 
provides such a function, xs:QName(), it's just not in the DFDL subset 
today. The string must be a lexical QName, ie, <prefix>:<name> and the 
prefix must be bound in what XPath calls the 'static context'. 

- DFDL should also include the XPath 2.0 function fn:QName() in its 
subset. This creates a QName from a namespace string and a name string. If 
you take SWIFT MT103 payload as an example, the tag in the envelope says 
"103" but a DFDL schema would actually model the global MT103 element with 
name "Document" and namespace ="urn:swift:xsd:fin.103.2011". 
So the dfdl:choiceBranchRef expression would have to look like:
{fn:QName(fn:concat(fn:concat('urn:swift:xsd:fin.', 
FinMessage/Block2/MessageType), ".2011"), 'Document')}

So we now have the ability to derive a QName and apply it before we start 
to process a choice. That makes the processing time for each branch of the 
choice independent of its order in the schema.

We still have issue #2 so when a new payload is added, a new branch must 
be added to the choice. A solution to this is to allows xs:any wildcard 
elements back into DFDL, then provide a property dfdl:wildcardRef which 
works in the same way as dfdl:choiceRef. So at the point of encountering 
the wildcard we know its resolution in the schema.  This obviously will 
require some further discussion, but you can see how this ability to 
evaluate an expression and return a QName can be used in multiple ways.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Steve Hanson

Mike Beckerle

tags

participants (2)