[DFDL-WG] How to determine the length of an element which has text representation

17 Nov 2009

      The current version of the specification ( v0.36) does not clearly specify 
how an element which has a specified length should be parsed.
- Section 14.3, when describing dfd:length says "Only used when lengthKind 
is ’explicit’ "
- The precedence rules say that when lengthKind="delimited", no other 
properties are consulted
- Section 17.3.2 has a comment saying that it is incorrect. The comment 
contains a couple of rather ambiguous statements about what the behaviour 
should be.

Alan proposes that the behaviour should be as follows: 
- When dfdlLength has a value, the length of the field must always conform 
to that value. 
- When there is terminating markup in scope ( terminators or separators ) 
the parser always uses them. 
- If a text field has a defined dfdl:length AND there is terminating 
markup in scope, then the parser should first scan to find the actual 
length, then check the actual length against dfdl:length and raise a 
processing error if they do not match.

I favour the following alternative rules
- dfdl:lengthKind always determines the method that the parser will use to 
the find the length of the element
- if lengthKind='explicit' or 'implicit' or 'prefixed' then the length is 
extracted without scanning. 
- if lengthKind='delimited' then the length is extracted by scanning and 
no check is performed against dfdl:length

The alternative rules have the following advantages:
- they provide a way of switching off scanning within the scope of a 
delimited structure. The proposed rules do not.
- they are easier to implement ( parser doesn't have to keep track of 
whether there is any terminating markup in scope - lengthKind always 
provides the rule )
- they are slightly easier to explain to users for the same reason

They do have the following drawbacks:
- dfdl:length is completely ignored when lengthKind='delimited'. It is not 
even used to validate the extracted length. Some users might not like 
this.
- there are known scenarios ( e.g. SWIFT 52B ) where it is necessary to 
check the length of a delimited field in order to choose the correct 
branch of a choice. Checking dfdl:length would make it easy to do that.

re: the ignoring of dfdl:length, we *could* make a rule that the length is 
checked after the delimited scan has been performed. But then it would be 
necessary to ensure that dfdl:length was un-set for the far more usual 
case where the length is not important.
I think the control of backtracking in the 52B scenario is an edge case. 
In most cases where delimited fields have a known length we can safely 
leave the length checking to the schema validator, or perhaps to a more 
functional complex validation layer. For 52B, the user will have to create 
a dfdl:assert to trigger the required processing error when the length is 
incorect.

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU