Re: [DFDL-WG] Arrays with empty elements

14 Feb 2011

      Thanks Alan - I've just realised that my example xsd snippet was 
incorrect. I was intending to make *all* occurrences optional, so the 
schema should have been:

<xs:element name="array" minOccurs="1" maxOccurs="1">
 <xs:complexType>
   <xs:sequence dfdl:sequenceKind="ordered" dfdl:separatorPosition="infix" 
dfdl:separatorPolicy="required" dfdl:separator=",">
     <xs:element name="array_item" type="xs:string" minOccurs="0" 
maxOccurs="unbounded"/>
   </xs:sequence> 
  </xs:complexType>
</xs:element>

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742

From:   Alan Powell/UK/IBM
To:     Tim Kimber/UK/IBM@IBMGB
Cc:     dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org
Date:   14/02/2011 14:55
Subject:        Re: [DFDL-WG] Arrays with empty elements

Tim

I believe that  'all occurrences up to minOccurs' being required   was 
intended to mean 'the first minOccurs occurrences are required'  not 'keep 
looking until you find minOccurs in the data stream'. Your row 3 would be 
an error if the element didn't have a default. 

A zero length is always treated as missing which means that  if you did 
want the empty string in the infoset you would need to have the empty 
string as the default.

Regards

Alan Powell

Development - MQSeries, Message Broker, ESB
IBM Software Group, Application and Integration Middleware Software
-------------------------------------------------------------------------------------------------------------------------------------------
IBM
MP211, Hursley Park
Hursley, SO21 2JN
United Kingdom
Phone: +44-1962-815073
e-mail: alan_powell@uk.ibm.com

From:   Tim Kimber/UK/IBM@IBMGB
To:     dfdl-wg@ogf.org
Date:   14/02/2011 13:01
Subject:        [DFDL-WG] Arrays with empty elements
Sent by:        dfdl-wg-bounces@ogf.org

Consider the following schema: 

<xs:element name="array" minOccurs="1" maxOccurs="1">
 <xs:complexType>
   <xs:sequence dfdl:sequenceKind="ordered" dfdl:separatorPosition="infix" 
dfdl:separatorPolicy="required" dfdl:separator=",">
     <xs:element name="array_item" type="xs:string" minOccurs="2" 
maxOccurs="unbounded"/>
   </xs:sequence> 
  </xs:complexType>
</xs:element>

Allowed data streams  and the resulting info sets ( rendered as XML ) are: 

item_value,item_value 
<array> 
        <array_item>item_value<array_item> 
        <array_item>item_value<array_item> 
</array> 
item_value, 
<array> 
        <array_item>item_value<array_item> 
</array> 
,item_value 
<array> 
        <array_item>item_value<array_item> 
</array> 
, 
<array> 
</array>

Notice rows 2 and 3. The parser has applied the rules in the DFDL 
specification, and has treated the zero-length elements as 'missing'. 
Furthermore, these missing elements are not required, so they are omitted 
from the info set. This is not good - the receiver of the info set has no 
way to reliably determine whether the array_item was the first or second 
item in the array. If presented to the DFDL serializer, both info sets 
will produce the data stream for row 2. 

Note that this is a problem only for arrays. A sequence of 
differently-named optional elements will not be ambiguous because the 
element names in the info set can be used to determine which elements were 
present in the data. 

Possible fixes: 
a) Change the definition of 'required' from 'all occurrences up to 
minOccurs' to 'all occurrences before the final non-missing occurrence'. 
In scenarios like the one above, non-required occurrences would be put 
into the infoset with a default value ( assuming that a default was 
defined in the model ). 
b) provide a dfdl property that controls whether elements with zero-length 
content are treated as missing. 
The presence of one or more delimiters ( a separator or iniitator or 
terminator ) implies that an element is present in the data. Currently, 
DFDL unconditionally treats an element as 'missing' if its content region 
is zero-length - regardless of whether there were any delimiters for that 
element. 
In this scenario, if the parser acted on that information then the info 
sets would be distinguishable. Suggested name for the property would be 
'dfdl:emptyValueMissingPolicy' with values 'missing' and 'included'. 

a) would require the parser to keep track of the last-reported occurrence 
of an array element. When a non-missing occurrence was encountered it 
would have to put any previously-skipped non-required occurrences into the 
infoset first. 
An example might help:  one,,,four 
Occurences 2 and 3 would be omitted from the infoset because they are 
zero-length. Upon ecountering occurrence 4, the parser would have to put 
occurrence 2 and 3 into the infoset with the xs:default value before 
putting 4 into the infoset. 

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

--
  dfdl-wg mailing list
  dfdl-wg@ogf.org
  http://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU