Consider the following schema:
<xs:element
name="array"
minOccurs="1"
maxOccurs="1">
<xs:complexType>
<xs:sequence
dfdl:sequenceKind="ordered"
dfdl:separatorPosition="infix"
dfdl:separatorPolicy="required"
dfdl:separator=",">
<xs:element
name="array_item"
type="xs:string"
minOccurs="2"
maxOccurs="unbounded"/>
</xs:sequence>
</xs:complexType>
</xs:element>
Allowed data streams and the resulting
info sets ( rendered as XML ) are:
item_value,item_value
| <array>
<array_item>item_value<array_item>
<array_item>item_value<array_item>
</array>
|
item_value,
| <array>
<array_item>item_value<array_item>
</array>
|
,item_value
| <array>
<array_item>item_value<array_item>
</array>
|
,
| <array>
</array> |
Notice rows 2 and 3. The parser has
applied the rules in the DFDL specification, and has treated the zero-length
elements as 'missing'. Furthermore, these missing elements are not required,
so they are omitted from the info set. This is not good - the receiver
of the info set has no way to reliably determine whether the array_item
was the first or second item in the array. If presented to the DFDL serializer,
both info sets will produce the data stream for row 2.
Note that this is a problem only for
arrays. A sequence of differently-named optional elements will not be ambiguous
because the element names in the info set can be used to determine which
elements were present in the data.
Possible fixes:
a) Change the definition of 'required'
from 'all occurrences up to minOccurs' to 'all occurrences before
the final non-missing occurrence'.
In scenarios like the one above, non-required
occurrences would be put into the infoset with a default value ( assuming
that a default was defined in the model ).
b) provide a dfdl property that controls
whether elements with zero-length content are treated as missing.
The presence of one or more delimiters
( a separator or iniitator or terminator ) implies that an element is present
in the data. Currently, DFDL unconditionally treats an element as
'missing' if its content region is zero-length - regardless of whether
there were any delimiters for that element.
In this scenario, if the parser acted
on that information then the info sets would be distinguishable. Suggested
name for the property would be 'dfdl:emptyValueMissingPolicy' with values
'missing' and 'included'.
a) would require the parser to keep
track of the last-reported occurrence of an array element. When a non-missing
occurrence was encountered it would have to put any previously-skipped
non-required occurrences into the infoset first.
An example might help: one,,,four
Occurences 2 and 3 would be omitted
from the infoset because they are zero-length. Upon ecountering occurrence
4, the parser would have to put occurrence 2 and 3 into the infoset with
the xs:default value before putting 4 into the infoset.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU