
I'd like to propose that the kind of thing you are talking about below Jim, where you get to decide what the presentation of the data is, is always done using layers. The point is that this lets us pick a base-level DFDL/XSD schema that is the one implied by the physical representation of the data, and use that as the starting point for layer transformations. Martin and I discussed this privately last fall at some point as the "XML all the way down" approach. I.e., at each layer, the layer above gets to think of the next layer down as if it were XML described by some XSD. The trick is the bottom-most-layer. Here there is no distinction between the shape of the actual physical representation and the shapes described in the XSD. If the physical rep has certain nested hierarchy, then the base-level XSD describes exactly that hierarchy and not any sort of mapping of it to something different in shape. This lowest level one-to-one mapping is how we get the induction of layer transformations started, and it allows for the first such transformation to be no different conceptually from any higher level. All transformations appear to be XML to XML conceptually, rather than the bottom one being something-else to XML, followed by all higher-level XML to XML ones. Now, the alternative is to have the lowest-level model be something different from XSD, and I'm sure this could be made workable too, but I thought we were all reconciled that any such model would be so similar to XSD, having parent-child hierarchy, choice, types, names, dimensions, etc. that making it different from XSD would just be confusing. If you're with me so far, then the only nub was that XSD doesn't directly model multi-dimensional arrays. Which brings us to the topic of the day. So with that I've been forging ahead on the path of worrying only about the lowest-layer schema where it is intended to correspond directly to the structure of the data, and there I'd like to suggest that multi-dimensional arrays have a base-level model in DFDL that looks like the one I proposed first over a week ago, with attributes and all. E.g., like this: <a x="5" y="7">4.32</a> <a x="5" y="8">2.212</a> This is a suitable representation for doing layered transformation into whatever other form you'd like, as it lets xpath-like expressions retrieve individual array elements by index positions. Of course if you actually convert your data into XML you'd pay the cost of these attributes in the text size, but if you're using a DFDL API-based system, then the implementation would be free to respond to requests for the "..../a/@x" attribute which return 5 (in the above example) without actually using any storage for that attribute.
-----Original Message----- From: Myers, James D [mailto:jim.myers@pnl.gov] Sent: Thursday, March 03, 2005 10:38 AM To: dfdl-wg@gridforum.org Subject: RE: [dfdl-wg] How to handle multi-dimensional arrays - version 2
Here's a slightly different formulation of the
multi-dimension stuff.
1) no longer dictates the XSD for representing the array. This cuts both ways since you no longer really have an XSD
model for
multi-dimensional arrays. That is. It is up to the author of the DFDL Schema to insure the needed information about the array (coordinates of each element) make it in to the logical model in a useful way.
I didn't realize we were proposing to extend the XML schema to have a multidimensional array type, versus providing a way for DFDL to read and internally represent a multidimensional array. The latter seems descriptive and the former prescriptive.
We have a different sense of descriptive and prescriptive. To me descriptive does NOT mean you get to choose and populate an arbitrary XML schema from data of any shape/form. To me "descriptive" means you tell me what your data looks like. But the schema you use to tell me that format is structured directly as a function of what the data format is like. That is, the Schema you provide must be written in a way which informs me (I'm the DFDL processor here) about what the data looks like. The schema is highly constrained in its shape/form by what the data's representation is. One of the members of my team describes this as "it's an input schema, not an output schema." It's a little confusing because DFDL will "prescribe" the schema, but does not "prescribe" the format of the data described by that schema. Rather it uses the "prescribed schema" to "describe" the data format. You do have freedom in what your data format actually is, you don't have freedom to choose what the description of it looks like. I think neither prescriptive nor descriptive data format says anything about the data transformation problem, which is is "how to I get data into a specific form". That's just plain different, and I claim it is the same whether the data started out in XML or in a file described by DFDL. Given data with a certain logical structure transform it into a different logical structure. Use Xquery or XSLT or a variety of non-XML oriented technologies to do so. What makes all this confusing for DFDL is that we have some representations that are complex enough to need layered multi-step descriptions, and once you have that, there's no stopping you from using it to do all sorts of transformation from one format to another. So it feels like you can have your cake and eat it too, which is to say you can pick your XML Schema and populate it from quite differently structured data. And that is probably true, but at the bottom level of the stack of layers you have to have a vocabulary and model for directly describing the structure of the data so as to get the whole ball rolling. And at this bottom layer, the needs of describing the data format completely dictate what the schema is like. Bringing this back to multidimensional arrays, it seems we have this choice: 1) add some new construct to XSD to support base-level description of multi-dimensional arrays 2) choose constructs already present in XSD and use them in a cannonical or "base" way to describe multi-dimensional arrays Currently I'm advocating (2), since there is enough expressive power there. XML/XSD basically can provide an arbitrary map from element names and attribute names and values onto values, and a multi-dimensional array is a subset of this kind of mapping functionality. So we define a canonical way of using these constructs to describe MD arrays and we're done.
2) I added in the complexity of calculating the array size,
actually
the lower and upper bounds of each dimension, dynamically based on data. This makes the example more real.
This still works out pretty well. I'm still pondering whether I like this better or not. I'm thinking about perhaps some sort of pseudo attributes which are guaranteed to be put into XML if you actually render to XML, but where a DFDL API-based implementation can choose not to realize them.
I think this example removes some of the prescriptive nature of the first one, but I'd like to be able to format my array however I want, e.g. as
<row><elem>3</elem><elem>2</elem></row> <row><elem>5</elem><elem>6</elem></row> ...
Or even
<states><state>Alabama</state><state>Alaska</state></states> <population><pop>34.2</pop><pop>10.6</pop></population> ....
(an array containing state names, population and other data, perhaps serialized in the file as all info for each state together).
If DFDL could separate the reading of such an array from how it is output in the schema, I could do any of this. Having multiple layers is a start - DFDL reads the array in to something that is addressable along the lines Mike proposes and then the contents of that layer are referenced via xpath to provide values in some structure I define in XSD. The only piece missing (I think) is that we haven't yet defined how to access iterators, i.e. if I have an element <elem minoccurs="1" maxoccurs = "5"> , how can I say that element n (n = 1...5) has dfdl:runtimevalue <a x="n" y="1">, which would put just the first column of a into the element sequence. If, in Mike's example, I could define the x and y dimensions independent of an array-reading context, just so I can use them in value references for dfdl:runtimevalue elements, I think we'd be all set.
This type of capability would allow all sorts of useful things - including the array to set of vectors conversions outlined here as well as subsampling, expansion/contraction of sprase arrays (where the array is stored as a sequence of x,y, value triples for only nonzero elements), etc.
One other minor point - if the order of x and y in the DFDL file is important (as it is in the example), do we need a <dfdl:array storageOrder="firstDimensionChangesFirst"> option? OR can we just list y first and then x?
Jim