RE: [dfdl-wg] How to deal with variable length elements?

Hi Mike, I still don't quite understand could you put it in the context of a more complete example? Thanks, Martin _____ From: owner-dfdl-wg@ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Mike Beckerle Sent: Wednesday, April 12, 2006 12:16 PM To: Westhead, Martin (Martin) Cc: dfdl-wg@ggf.org; Robert E. McGrath; owner-dfdl-wg@ggf.org Subject: RE: [dfdl-wg] How to deal with variable length elements? The reason for the special @dfdl:index is because of the Xpath rules that position() is always inside the current context expression containing the call to position(). We need to index another structure with our index position. Can't do this in straight Xpath. If we add a "Let x = position() in ..." construct (Xquery has this), then we wouldn't need the @dfdl:index. ...mikeb Mike Beckerle STSM, Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA 01581 voice and FAX 508-599-7148 home/mobile office 508-915-4795 "Westhead, Martin (Martin)" <westhead@avaya.com> Sent by: owner-dfdl-wg@ggf.org 04/05/2006 10:26 AM To Mike Beckerle/Worcester/IBM@IBMUS, "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> cc <dfdl-wg@ggf.org> Subject RE: [dfdl-wg] How to deal with variable length elements? Hi, This looks fine to me (modulo old syntax) except I don't understand the need to introduce this construct: "@dfdl:index". I guess in today's terms we might thing of this as a value in the context. However, I claim that we don't need it. I think we can achieve the same effect with the XPath function position(). position() should tell you where you are in the current sequence. If you want to know where you are in the parent sequence you can use ../position(). If your index starts from 0 use position()-1. If you store two elements in your sequence for every index (e.g. flat array of x-y coordinates) use position()/2. What does @dfdl:index do that you don't get out of position()? Cheers, Martin _____ From: owner-dfdl-wg@ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Mike Beckerle Sent: Friday, March 17, 2006 2:14 PM To: Robert E. McGrath Cc: dfdl-wg@ggf.org; owner-dfdl-wg@ggf.org Subject: Re: [dfdl-wg] How to deal with variable length elements? Hmmm. I think layered value calculation formulas which allow for a magic "myIndex" variable are perhaps an important device to make this class of layering possible.This makes the iteration over the elements implicit. I have one example which is the one where there is a vector of strings where the lengths of all the strings are stored first, separately from all the character data. A formula involving "myIndex" is used to glue the two pieces together. Here's the example as per our prototype from last summer: ...mikeb "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 03/17/2006 10:54 AM To dfdl-wg@ggf.org cc Subject Re: [dfdl-wg] How to deal with variable length elements? Following up on my email on ealier this week: I think there was a major flaw in what I wrote, and it is quite an "interesting" challenge. Let me review: I am thinking about how to describe reading data into a 1D array. Steve provided a markup for the XML element. The challenge I'm looking at is that the data need not be a image of the memory layout. To give one example, a very sparse array might be stored as a series of (index, value) pairs for the non-empty places, all others implied to be zero or fill or whatever. The goal is to have the XML array be fully populated from this sparse form--or whatever layout--on disk. (Please assume for now that this is a reasonable goal!) The XML and DFDL will tell us the data type, and presumably we know the extent of the data on disk. But we need to decode the storage to generate all the elements values and fills. In my earlier email, I offered a description that included an 'Iterator' conversion. I now think this is inadequate. In fact you need two cooperating 'Iterators'! Ick! Here is my revised pipeline. Data is read from bottom to top. I sketch what each conversion is tasked to do. I think the 'Decoder' needs to know info from both the 'Iterator' (it asks for each element in the order it wants them) and 'Float' (it tells the size of the 'value' to get). == <<XML element with multiple occurs: 1D array >> ^ | Iterator conversion: relevant props: minOccurs="0" maxOccurs="<<setting>>, et al Get 'maxOccurs' elements of type datatype. ^ | Float conversion: relevant props: data description of element Decode bytes ^ | Decoder conversion: produces the bytes 'nth' _value_ in the array. Input: what position is needed. may need separators and other props: depends on encoding Output: sizeof datatype bytes, the _value_ Side effect: after whole array is read, consumes all the storage. Difficult to characterize the intermediate state. ^ | Data: read as bytes

I did some thinking on this topic. At least some IBMers thought this was coherent.

Ok the question is about this fragment of a DFDL schema that I sent before. This is from the 'stringWithAllLengthsFirst' example. See in here the <dfdl:storedLengthCalc> which is old property syntax, but anyway gives the 'expression' which calculates the value of the length of this string element. The element is an array named 'data' of strings, but the length of the array itself is elsewhere, and the lengths of each of the variable-length strings in this array are also stored elsewhere in another array named 'rephdr/storedLengths' <xs:element name="data" type="xs:string" maxOccurs="unbounded"> <xs:annotation> <xs:appinfo source="http://dataformat.org/"> <!-- dataFormat's about attribute lets you narrow the scope of the --> <!-- properties it defines. The allowed values are array and --> <!-- arrayElement. arrayElement is the default. --> <dfdl:dataFormat about="array" repLengthUnitKind="elements"> <dfdl:storedLengthCalc> ../rephdr/count </dfdl:storedLengthCalc> </dfdl:dataFormat> <dfdl:dataFormat about="arrayElement" repLengthUnitKind="characters" repType="text" charset="US-ASCII"> <!-- Attributes in the DFDL namespace are special. They allow the --> <!-- DFDL author to access the Instance's runtime metadata. In this--> <!-- we're using @dfdl:index, which stores the current Instance's --> <!-- position in its parent array. --> <dfdl:storedLengthCalc> ../../rephdr/stringLengths[@dfdl:index] </dfdl:storedLengthCalc> </dfdl:dataFormat> Now suppose we changed: . ../../rephdr/stringLengths[@dfdl:index] to: ../../rephdr/stringLengths[position()] This wouldn't mean the same thing. In this case position() is the position inside the rephdr/stringLengths vector, not in 'this vector I'm populating/parsing' which in the example is a vector of strings. However if we could write: Let $pos = position(); // in my context. This means 'my position' in ../../rephdr/stringLengths[$pos] That would work and avoid us introducing a magic dfdl:index variable. . "Westhead, Martin \(Martin\)" <westhead@avaya.com> 04/12/2006 12:19 PM To Mike Beckerle/Worcester/IBM@IBMUS cc <dfdl-wg@ggf.org>, "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu>, <owner-dfdl-wg@ggf.org> Subject RE: [dfdl-wg] How to deal with variable length elements? Hi Mike, I still don?t quite understand could you put it in the context of a more complete example? Thanks, Martin From: owner-dfdl-wg@ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Mike Beckerle Sent: Wednesday, April 12, 2006 12:16 PM To: Westhead, Martin (Martin) Cc: dfdl-wg@ggf.org; Robert E. McGrath; owner-dfdl-wg@ggf.org Subject: RE: [dfdl-wg] How to deal with variable length elements? The reason for the special @dfdl:index is because of the Xpath rules that position() is always inside the current context expression containing the call to position(). We need to index another structure with our index position. Can't do this in straight Xpath. If we add a "Let x = position() in ..." construct (Xquery has this), then we wouldn't need the @dfdl:index. ...mikeb Mike Beckerle STSM, Architect, Scalable Computing IBM Software Group Information Integration Solutions Westborough, MA 01581 voice and FAX 508-599-7148 home/mobile office 508-915-4795 "Westhead, Martin (Martin)" <westhead@avaya.com> Sent by: owner-dfdl-wg@ggf.org 04/05/2006 10:26 AM To Mike Beckerle/Worcester/IBM@IBMUS, "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> cc <dfdl-wg@ggf.org> Subject RE: [dfdl-wg] How to deal with variable length elements? Hi, This looks fine to me (modulo old syntax) except I don?t understand the need to introduce this construct: ?@dfdl:index?. I guess in today?s terms we might thing of this as a value in the context. However, I claim that we don?t need it. I think we can achieve the same effect with the XPath function position(). position() should tell you where you are in the current sequence. If you want to know where you are in the parent sequence you can use ../position(). If your index starts from 0 use position()-1. If you store two elements in your sequence for every index (e.g. flat array of x-y coordinates) use position()/2. What does @dfdl:index do that you don?t get out of position()? Cheers, Martin From: owner-dfdl-wg@ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of Mike Beckerle Sent: Friday, March 17, 2006 2:14 PM To: Robert E. McGrath Cc: dfdl-wg@ggf.org; owner-dfdl-wg@ggf.org Subject: Re: [dfdl-wg] How to deal with variable length elements? Hmmm. I think layered value calculation formulas which allow for a magic "myIndex" variable are perhaps an important device to make this class of layering possible.This makes the iteration over the elements implicit. I have one example which is the one where there is a vector of strings where the lengths of all the strings are stored first, separately from all the character data. A formula involving "myIndex" is used to glue the two pieces together. Here's the example as per our prototype from last summer: ...mikeb "Robert E. McGrath" <mcgrath@ncsa.uiuc.edu> Sent by: owner-dfdl-wg@ggf.org 03/17/2006 10:54 AM To dfdl-wg@ggf.org cc Subject Re: [dfdl-wg] How to deal with variable length elements? Following up on my email on ealier this week: I think there was a major flaw in what I wrote, and it is quite an "interesting" challenge. Let me review: I am thinking about how to describe reading data into a 1D array. Steve provided a markup for the XML element. The challenge I'm looking at is that the data need not be a image of the memory layout. To give one example, a very sparse array might be stored as a series of (index, value) pairs for the non-empty places, all others implied to be zero or fill or whatever. The goal is to have the XML array be fully populated from this sparse form--or whatever layout--on disk. (Please assume for now that this is a reasonable goal!) The XML and DFDL will tell us the data type, and presumably we know the extent of the data on disk. But we need to decode the storage to generate all the elements values and fills. In my earlier email, I offered a description that included an 'Iterator' conversion. I now think this is inadequate. In fact you need two cooperating 'Iterators'! Ick! Here is my revised pipeline. Data is read from bottom to top. I sketch what each conversion is tasked to do. I think the 'Decoder' needs to know info from both the 'Iterator' (it asks for each element in the order it wants them) and 'Float' (it tells the size of the 'value' to get). == <<XML element with multiple occurs: 1D array >> ^ | Iterator conversion: relevant props: minOccurs="0" maxOccurs="<<setting>>, et al Get 'maxOccurs' elements of type datatype. ^ | Float conversion: relevant props: data description of element Decode bytes ^ | Decoder conversion: produces the bytes 'nth' _value_ in the array. Input: what position is needed. may need separators and other props: depends on encoding Output: sizeof datatype bytes, the _value_ Side effect: after whole array is read, consumes all the storage. Difficult to characterize the intermediate state. ^ | Data: read as bytes
participants (2)
-
Mike Beckerle
-
Westhead, Martin (Martin)