How to deal with variable length elements?

newer
Transformation between Binary data...

Robert E. McGrath

14 Mar 2006 14 Mar '06

3:25 p.m.

Folks, I'm trying to build up a story about how to handle arrays using DFDL. But first, I need to check if I understand the basics. Here is an example of a 1D array in XML, modeled as two elements, an integer indicating how many elements, followed by an array of zero or more floats. Looking at my XML textbooks, the following seems like the correct XML schema for this notion. +++  <xs: element type="float" name="floatType" \>  <xs:complexType name="floatArray"> <xs:sequence> <xs:element name="nelems type="int" /> <xs:element name="x" ref="floatType" minOccurs="0" maxOccurs="./nelems" /> </xs:sequence> </xs:complexType> +++ Do I have this correct? (I'm pretty sure the 'maxOccurs' isn't correct, so I hope someone will tell me the right way to do this.) If so, the next question will be "how do I annotate this with DFDL?", e.g., when the data is precisely one binary (or text encoded) int, followed by some binary (or text encoeded) floats.

Show replies by date

Steve Hanson

14 Mar 14 Mar

4:31 p.m.

New subject: [dfdl-wg] How to deal with variable length elements?

Hi Bob, quite timely as I've been looking at DFDL array properties but limiting myself to 1-dim only, just to establish the basics. I've annotated your XML below with DFDL annotations to describe the array. I'm assuming that there is no markup (separators etc) in your format, ie, it is just an integer followed by floats. I fully realise that my scheme does not handle multi-dim or sparse arrays. Your maxOccurs I've corrected to "unbounded". Remember that XML can always tell the number of items in an array, by using the tags, hence there is never any need to include a count in XML. +++  <xs: element type="float" name="floatType" \>  <xs:complexType name="floatArray"> <xs:sequence> <xs:element name="nelems type="int" /> <xs:annotation><xs:appinfo source="http://dataformat.org"> <dfdl:element repType="binaryInteger" signed="false" lengthKind="fixed" length="4" /> </xs:appinfo></xs:annotation> <xs:element name="x" ref="floatType" minOccurs="0" maxOccurs="unbounded" /> <xs:annotation><xs:appinfo source="http://dataformat.org"> <dfdl:element repType="binaryFloat" floatType="IEEEExtendedIntel" lengthKind="fixed" length="4" occursDeterminedBy="xpath" occursPath="./nelems" /> </xs:appinfo></xs:annotation> </xs:sequence> </xs:complexType> +++ Here's a second variation where the number of occurrences is fixed. If so we assume maxOccurs holds the actual number. (That's up for debate, maybe we need a separate DFDL occurs count independent of min/maxOccurs?). +++  <xs:complexType name="floatArray2"> <xs:sequence> <xs:element name="x" ref="floatType" minOccurs="0" maxOccurs="10" /> <xs:annotation><xs:appinfo source="http://dataformat.org"> <dfdl:element repType="binaryFloat" floatType="IEEEExtendedIntel" lengthKind="fixed" length="4" occursDeterminedBy="maxOccurs" /> </xs:appinfo></xs:annotation> </xs:sequence> </xs:complexType> +++ Here's a third where the number of occurrences is given by a terminating value. +++  <xs:complexType name="floatArray3"> <xs:sequence> <xs:element name="x" ref="floatType" minOccurs="0" maxOccurs="unbounded" /> <xs:annotation><xs:appinfo source="http://dataformat.org"> <dfdl:element repType="binaryFloat" floatType="IEEEExtendedIntel" lengthKind="fixed" length="4" occursDeterminedBy="value" occursTerminatingValueKind="logical" occursTerminatingValue="-99999"/> </xs:appinfo></xs:annotation> </xs:sequence> </xs:complexType> +++ Here's the definition of DFDL occursDeterminedBy, which seems to me to capture all the possibilities for establishing the number: "Enum. Valid values ‘maxOccurs’, ‘xpath’, ‘value’, ‘markup’. Specifies how the actual number of occurrences is to be established. ‘maxOccurs’ means use the value of maxOccurs, ‘xpath’ means use the value of a named field earlier in the data, ‘value’ means there is a special terminating value, ‘markup’ means that separators and/or initiators dictate the number." Regards, Steve Steve Hanson WebSphere Message Brokers, IBM United Kingdom Ltd, Hursley, UK Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848 "Robert E. McGrath" <mcgrath@ncsa.uiu To c.edu> dfdl-wg@ggf.org Sent by: cc owner-dfdl-wg@ggf .org Subject [dfdl-wg] How to deal with variable length elements? 14/03/2006 15:25 Folks, I'm trying to build up a story about how to handle arrays using DFDL. But first, I need to check if I understand the basics. Here is an example of a 1D array in XML, modeled as two elements, an integer indicating how many elements, followed by an array of zero or more floats. Looking at my XML textbooks, the following seems like the correct XML schema for this notion. +++  <xs: element type="float" name="floatType" \>  <xs:complexType name="floatArray"> <xs:sequence> <xs:element name="nelems type="int" /> <xs:element name="x" ref="floatType" minOccurs="0" maxOccurs="./nelems" /> </xs:sequence> </xs:complexType> +++ Do I have this correct? (I'm pretty sure the 'maxOccurs' isn't correct, so I hope someone will tell me the right way to do this.) If so, the next question will be "how do I annotate this with DFDL?", e.g., when the data is precisely one binary (or text encoded) int, followed by some binary (or text encoeded) floats.

Robert E. McGrath

6:55 p.m.

New subject: [dfdl-wg] How to deal with variable length elements?

Thanks STeve. I'll look through these carefully. On Tuesday 14 March 2006 10:31, Steve Hanson wrote:

...

I fully realise that my scheme does not handle multi-dim or sparse arrays.

This is OK. My approach is to show that we can handle 1D arrays, and then try to show that we can do multiD arrays via a hidden 1D array. One step at a time! The second step may or may not be possible within a single DFDL schema. I don't knwo yet. --- Robert E. McGrath, Ph.D. National Center for Supercomputing Applications University of Illinois, Urbana-Champaign 1205 West Clark Urbana, Illinois 61801 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Robert E. McGrath

15 Mar 15 Mar

4:06 p.m.

New subject: [dfdl-wg] How to deal with variable length elements?

Following Steve's sketch, how should this be represented as conversions? This is a bit different than other examples because there is a "for each" operation here. Perhaps this could be abstractly viewed as: == <<XML element with multiple occurs, as shown yesterday>> Iterator conversion: relevant props: minOccurs="0" maxOccurs="<<setting>>, et al Float conversion: relevant props: data description of element Data: read as bytes == Here is why I want to break out the "for each" as a separate operation. We will need to deal with the case where the stored data is not necessarily memory image of the desired XML element. I.e., the numbers might be in alternative order, or might have implied values not in the data, or might not be contiguous in the storage. In these cases, we want to substitute a alternative "Iterator" that understands where to find (or how to compute) the 'nth' element. But we want to use the float conversion for each element. If we can support this, then we can let users deal with whatever clever storage schemes might be used, to generate a 1D array of elements in a known order. The latter can be used for further conversions or by XSL in a portable and generic way. On Tuesday 14 March 2006 10:31, Steve Hanson wrote:

...

Hi Bob, quite timely as I've been looking at DFDL array properties but limiting myself to 1-dim only, just to establish the basics. I've annotated your XML below with DFDL annotations to describe the array. I'm assuming that there is no markup (separators etc) in your format, ie, it is just an integer followed by floats. I fully realise that my scheme does not handle multi-dim or sparse arrays.

Your maxOccurs I've corrected to "unbounded". Remember that XML can always tell the number of items in an array, by using the tags, hence there is never any need to include a count in XML.

+++

 <xs: element type="float" name="floatType" \>

 <xs:complexType name="floatArray"> <xs:sequence> <xs:element name="nelems type="int" /> <xs:annotation><xs:appinfo source="http://dataformat.org"> <dfdl:element repType="binaryInteger" signed="false" lengthKind="fixed" length="4" /> </xs:appinfo></xs:annotation> <xs:element name="x" ref="floatType" minOccurs="0" maxOccurs="unbounded" /> <xs:annotation><xs:appinfo source="http://dataformat.org"> <dfdl:element repType="binaryFloat" floatType="IEEEExtendedIntel" lengthKind="fixed" length="4" occursDeterminedBy="xpath" occursPath="./nelems" /> </xs:appinfo></xs:annotation> </xs:sequence> </xs:complexType>

+++

Here's a second variation where the number of occurrences is fixed. If so we assume maxOccurs holds the actual number. (That's up for debate, maybe we need a separate DFDL occurs count independent of min/maxOccurs?).

+++

 <xs:complexType name="floatArray2"> <xs:sequence> <xs:element name="x" ref="floatType" minOccurs="0" maxOccurs="10" /> <xs:annotation><xs:appinfo source="http://dataformat.org"> <dfdl:element repType="binaryFloat" floatType="IEEEExtendedIntel" lengthKind="fixed" length="4" occursDeterminedBy="maxOccurs" /> </xs:appinfo></xs:annotation> </xs:sequence> </xs:complexType>

+++

Here's a third where the number of occurrences is given by a terminating value.

+++

 <xs:complexType name="floatArray3"> <xs:sequence> <xs:element name="x" ref="floatType" minOccurs="0" maxOccurs="unbounded" /> <xs:annotation><xs:appinfo source="http://dataformat.org"> <dfdl:element repType="binaryFloat" floatType="IEEEExtendedIntel" lengthKind="fixed" length="4" occursDeterminedBy="value" occursTerminatingValueKind="logical" occursTerminatingValue="-99999"/> </xs:appinfo></xs:annotation> </xs:sequence> </xs:complexType>

+++

Here's the definition of DFDL occursDeterminedBy, which seems to me to capture all the possibilities for establishing the number:

"Enum. Valid values ‘maxOccurs’, ‘xpath’, ‘value’, ‘markup’. Specifies how the actual number of occurrences is to be established. ‘maxOccurs’ means use the value of maxOccurs, ‘xpath’ means use the value of a named field earlier in the data, ‘value’ means there is a special terminating value, ‘markup’ means that separators and/or initiators dictate the number."

Regards, Steve

Steve Hanson WebSphere Message Brokers, IBM United Kingdom Ltd, Hursley, UK Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848

"Robert E. McGrath" <mcgrath@ncsa.uiu To c.edu> dfdl-wg@ggf.org Sent by: cc owner-dfdl-wg@ggf .org Subject [dfdl-wg] How to deal with variable length elements? 14/03/2006 15:25

Folks,

I'm trying to build up a story about how to handle arrays using DFDL.

But first, I need to check if I understand the basics.

Here is an example of a 1D array in XML, modeled as two elements, an integer indicating how many elements, followed by an array of zero or more floats.

Looking at my XML textbooks, the following seems like the correct XML schema for this notion.

+++

 <xs: element type="float" name="floatType" \>

 <xs:complexType name="floatArray"> <xs:sequence> <xs:element name="nelems type="int" /> <xs:element name="x" ref="floatType" minOccurs="0" maxOccurs="./nelems" /> </xs:sequence> </xs:complexType>

+++

Do I have this correct? (I'm pretty sure the 'maxOccurs' isn't correct, so I hope someone will tell me the right way to do this.)

If so, the next question will be "how do I annotate this with DFDL?", e.g., when the data is precisely one binary (or text encoded) int, followed by some binary (or text encoeded) floats.

-- --- Robert E. McGrath, Ph.D. National Center for Supercomputing Applications University of Illinois, Urbana-Champaign 1205 West Clark Urbana, Illinois 61801 (217)-333-6549 mcgrath@ncsa.uiuc.edu

Robert E. McGrath

17 Mar 17 Mar

3:54 p.m.

New subject: [dfdl-wg] How to deal with variable length elements?

Following up on my email on ealier this week: I think there was a major flaw in what I wrote, and it is quite an "interesting" challenge. Let me review: I am thinking about how to describe reading data into a 1D array. Steve provided a markup for the XML element. The challenge I'm looking at is that the data need not be a image of the memory layout. To give one example, a very sparse array might be stored as a series of (index, value) pairs for the non-empty places, all others implied to be zero or fill or whatever. The goal is to have the XML array be fully populated from this sparse form--or whatever layout--on disk. (Please assume for now that this is a reasonable goal!) The XML and DFDL will tell us the data type, and presumably we know the extent of the data on disk. But we need to decode the storage to generate all the elements values and fills. In my earlier email, I offered a description that included an 'Iterator' conversion. I now think this is inadequate. In fact you need two cooperating 'Iterators'! Ick! Here is my revised pipeline. Data is read from bottom to top. I sketch what each conversion is tasked to do. I think the 'Decoder' needs to know info from both the 'Iterator' (it asks for each element in the order it wants them) and 'Float' (it tells the size of the 'value' to get). == <<XML element with multiple occurs: 1D array >> ^ | Iterator conversion: relevant props: minOccurs="0" maxOccurs="<<setting>>, et al Get 'maxOccurs' elements of type datatype. ^ | Float conversion: relevant props: data description of element Decode bytes ^ | Decoder conversion: produces the bytes 'nth' _value_ in the array. Input: what position is needed. may need separators and other props: depends on encoding Output: sizeof datatype bytes, the _value_ Side effect: after whole array is read, consumes all the storage. Difficult to characterize the intermediate state. ^ | Data: read as bytes

Mike Beckerle

7:14 p.m.

New subject: [dfdl-wg] How to deal with variable length elements?

Hmmm. I think layered value calculation formulas which allow for a magic "myIndex" variable are perhaps an important device to make this class of layering possible.This makes the iteration over the elements implicit. I have one example which is the one where there is a vector of strings where the lengths of all the strings are stored first, separately from all the character data. A formula involving "myIndex" is used to glue the two pieces together. Here's the example as per our prototype from last summer: ...mikeb "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 03/17/2006 10:54 AM To dfdl-wg@ggf.org cc Subject Re: [dfdl-wg] How to deal with variable length elements? Following up on my email on ealier this week: I think there was a major flaw in what I wrote, and it is quite an "interesting" challenge. Let me review: I am thinking about how to describe reading data into a 1D array. Steve provided a markup for the XML element. The challenge I'm looking at is that the data need not be a image of the memory layout. To give one example, a very sparse array might be stored as a series of (index, value) pairs for the non-empty places, all others implied to be zero or fill or whatever. The goal is to have the XML array be fully populated from this sparse form--or whatever layout--on disk. (Please assume for now that this is a reasonable goal!) The XML and DFDL will tell us the data type, and presumably we know the extent of the data on disk. But we need to decode the storage to generate all the elements values and fills. In my earlier email, I offered a description that included an 'Iterator' conversion. I now think this is inadequate. In fact you need two cooperating 'Iterators'! Ick! Here is my revised pipeline. Data is read from bottom to top. I sketch what each conversion is tasked to do. I think the 'Decoder' needs to know info from both the 'Iterator' (it asks for each element in the order it wants them) and 'Float' (it tells the size of the 'value' to get). == <<XML element with multiple occurs: 1D array >> ^ | Iterator conversion: relevant props: minOccurs="0" maxOccurs="<<setting>>, et al Get 'maxOccurs' elements of type datatype. ^ | Float conversion: relevant props: data description of element Decode bytes ^ | Decoder conversion: produces the bytes 'nth' _value_ in the array. Input: what position is needed. may need separators and other props: depends on encoding Output: sizeof datatype bytes, the _value_ Side effect: after whole array is read, consumes all the storage. Difficult to characterize the intermediate state. ^ | Data: read as bytes

7069

Age (days ago)

7072

Last active (days ago)

List overview

Download

5 comments

3 participants

participants (3)

Mike Beckerle
Robert E. McGrath
Steve Hanson