A small correction:
the parsing rules I propose, and I think
what is currently in the spec, are
- for fixed length 'text' elements (lengthKind
is 'implicit' or 'explicit') that also has terminating markup (terminator
or in-scope separator or terminator) then the parser should scan for the
markup then check the length
- for fixed length 'text' elements (lengthKind
is 'implicit' or 'explicit') with no terminating markup the length is used
- for fixed length 'binary' fields,
which are not scannable, with terminating markup then the length should
be used to extract the field then scan for markup. (I'm not sure this is
a realistic scenario but it is allowed.)
- for fixed length 'binary' fields without
terminating markup then the length should be used
- for fixed length complex elements
with terminating markup each child is treated as above. When the
end of the complex element is found it is compared to the fixed length
- for fixed length complex elements
without terminating markup the length is used to extract the element and
that 'buffer' is parsed for the children.
- I was not suggesting that dfdl:length
should be examined for any lengthKind other than explicit
Notes:
Because lengthKind explicit is used
to specify a fixed length or a reference to a length field it isn't possible
we have to treat them the same way even. However if the found length doesn't
match the 'fixed' length it should be a processing error and cause backtracking
but if the reference length doesn't match it should be a hard error. Perhaps
we need a way to distinguish between these cases.
There needs to be similar rules for
the other lengthKinds, eg prefixed, with terminating markup.
I will put this on the agenda for this
weeks call
Alan Powell
MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell@uk.ibm.com
Tel: +44 (0)1962 815073
Fax: +44 (0)1962 816898
From:
| Tim Kimber/UK/IBM@IBMGB
|
To:
| dfdl-wg@ogf.org
|
Date:
| 17/11/2009 01:42
|
Subject:
| [DFDL-WG] How to determine the length
of an element which has text representation |
The current version of the specification ( v0.36) does not clearly specify
how an element which has a specified length should be parsed.
- Section 14.3, when describing dfd:length says "Only
used when lengthKind is ’explicit’ "
- The precedence rules say that when lengthKind="delimited",
no other properties are consulted
- Section 17.3.2 has a comment saying that it is incorrect. The comment
contains a couple of rather ambiguous statements about what the behaviour
should be.
Alan proposes that the behaviour should be as follows:
- When dfdlLength has a value, the length of the field must always conform
to that value.
- When there is terminating markup in scope ( terminators or separators
) the parser always uses them.
- If a text field has a defined dfdl:length AND there is terminating markup
in scope, then the parser should first scan to find the actual length,
then check the actual length against dfdl:length and raise a processing
error if they do not match.
I favour the following alternative rules
- dfdl:lengthKind always determines the method that the parser will use
to the find the length of the element
- if lengthKind='explicit' or 'implicit' or 'prefixed' then the length
is extracted without scanning.
- if lengthKind='delimited' then the length is extracted by scanning and
no check is performed against dfdl:length
The alternative rules have the following advantages:
- they provide a way of switching off scanning within the scope of a delimited
structure. The proposed rules do not.
- they are easier to implement ( parser doesn't have to keep track of whether
there is any terminating markup in scope - lengthKind always provides the
rule )
- they are slightly easier to explain to users for the same reason
They do have the following drawbacks:
- dfdl:length is completely ignored when lengthKind='delimited'. It is
not even used to validate the extracted length. Some users might not like
this.
- there are known scenarios ( e.g. SWIFT 52B ) where it is necessary to
check the length of a delimited field in order to choose the correct branch
of a choice. Checking dfdl:length would make it easy to do that.
re: the ignoring of dfdl:length, we *could* make a rule that the length
is checked after the delimited scan has been performed. But then it would
be necessary to ensure that dfdl:length was un-set for the far more usual
case where the length is not important.
I think the control of backtracking in the 52B scenario is an edge case.
In most cases where delimited fields have a known length we can safely
leave the length checking to the schema validator, or perhaps to a more
functional complex validation layer. For 52B, the user will have to create
a dfdl:assert to trigger the required processing error when the length
is incorect.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU