RE: [dfdl-wg] How to handle multi-dimensional arrays - version 2

3 Mar 2005

      I'd like to propose that the kind of thing you are talking about below Jim,
where you get to decide what the presentation of the data is, is always done
using layers. The point is that this lets us pick a base-level DFDL/XSD
schema that is the one implied by the physical representation of the data,
and use that as the starting point for layer transformations.

Martin and I discussed this privately last fall at some point as the "XML
all the way down" approach. I.e., at each layer, the layer above gets to
think of the next layer down as if it were XML described by some XSD. The
trick is the bottom-most-layer. Here there is no distinction between the
shape of the actual physical representation and the shapes described in the
XSD. If the physical rep has certain nested hierarchy, then the base-level
XSD describes exactly that hierarchy and not any sort of mapping of it to
something different in shape. This lowest level one-to-one mapping is how we
get the induction of layer transformations started, and it allows for the
first such transformation to be no different conceptually from any higher
level. All transformations appear to be XML to XML conceptually, rather than
the bottom one being something-else to XML, followed by all higher-level XML
to XML ones.

Now, the alternative is to have the lowest-level model be something
different from XSD, and I'm sure this could be made workable too, but I
thought we were all reconciled that any such model would be so similar to
XSD, having parent-child hierarchy, choice, types, names, dimensions, etc.
that making it different from XSD would just be confusing. 

If you're with me so far, then the only nub was that XSD doesn't directly
model multi-dimensional arrays. Which brings us to the topic of the day.

So with that I've been forging ahead on the path of worrying only about the
lowest-layer schema where it is intended to correspond directly to the
structure of the data, and there I'd like to suggest that multi-dimensional
arrays have a base-level model in DFDL that looks like the one I proposed
first over a week ago, with attributes and all. E.g., like this:

<a x="5" y="7">4.32</a>
<a x="5" y="8">2.212</a>

This is a suitable representation for doing layered transformation into
whatever other form you'd like, as it lets xpath-like expressions retrieve
individual array elements by index positions.

Of course if you actually convert your data into XML you'd pay the cost of
these attributes in the text size, but if you're using a DFDL API-based
system, then the implementation would be free to respond to requests for the
"..../a/@x" attribute which return 5 (in the above example) without actually
using any storage for that attribute.
...
-----Original Message-----
From: Myers, James D [mailto:jim.myers@pnl.gov] 
Sent: Thursday, March 03, 2005 10:38 AM
To: dfdl-wg@gridforum.org
Subject: RE: [dfdl-wg] How to handle multi-dimensional arrays 
- version 2
...
Here's a slightly different formulation of the
multi-dimension stuff.
...
1) no longer dictates the XSD for representing the array. 
This cuts both ways since you no longer really have an XSD
model for
...
multi-dimensional arrays. That is. It is up to the author 
of the DFDL 
Schema to insure the needed information about the array 
(coordinates 
of each element) make it in to the logical model in a useful way.
I didn't realize we were proposing to extend the XML schema 
to have a multidimensional array type, versus providing a way 
for DFDL to read and internally represent a multidimensional 
array. The latter seems descriptive and the former prescriptive.
We have a different sense of descriptive and prescriptive. To me descriptive
does NOT mean you get to choose and populate an arbitrary XML schema from
data of any shape/form.  To me "descriptive" means you tell me what your
data looks like. But the schema you use to tell me that format is structured
directly as a function of what the data format is like. That is, the Schema
you provide must be written in a way which informs me (I'm the DFDL
processor here) about what the data looks like. The schema is highly
constrained in its shape/form by what the data's representation is. 

One of the members of my team describes this as "it's an input schema, not
an output schema."

It's a little confusing because DFDL will "prescribe" the schema, but does
not "prescribe" the format of the data described by that schema. Rather it
uses the "prescribed schema" to "describe" the data format. You do have
freedom in what your data format actually is, you don't have freedom to
choose what the description of it looks like. 

I think neither prescriptive nor descriptive data format says anything about
the data transformation problem, which is is "how to I get data into a
specific form". That's just plain different, and I claim it is the same
whether the data started out in XML or in a file described by DFDL. Given
data with a certain logical structure transform it into a different logical
structure. Use Xquery or XSLT or a variety of non-XML oriented technologies
to do so.  

What makes all this confusing for DFDL is that we have some representations
that are complex enough to need layered multi-step descriptions, and once
you have that, there's no stopping you from using it to do all sorts of
transformation from one format to another. So it feels like you can have
your cake and eat it too, which is to say you can pick your XML Schema and
populate it from quite differently structured data. And that is probably
true, but at the bottom level of the stack of layers you have to have a
vocabulary and model for directly describing the structure of the data so as
to get the whole ball rolling. And at this bottom layer, the needs of
describing the data format completely dictate what the schema is like.  

Bringing this back to multidimensional arrays, it seems we have this choice:

1) add some new construct to XSD to support base-level description of
multi-dimensional arrays
2) choose constructs already present in XSD and use them in a cannonical or
"base" way to describe multi-dimensional arrays

Currently I'm advocating (2), since there is enough expressive power there.
XML/XSD basically can provide an arbitrary map from element names and
attribute names and values onto values, and a multi-dimensional array is a
subset of this kind of mapping functionality. So we define a canonical way
of using these constructs to describe MD arrays and we're done.
...
...
2) I added in the complexity of calculating the array size,
actually
...
the lower and upper bounds of each dimension, dynamically based on 
data. This makes the example more real.
This still works out pretty well. I'm still pondering 
whether I like 
this better or not. I'm thinking about perhaps some sort of pseudo 
attributes which are guaranteed to be put into XML if you actually 
render to XML, but where a DFDL API-based implementation can choose 
not to realize them.
I think this example removes some of the prescriptive nature 
of the first one, but I'd like to be able to format my array 
however I want, e.g. as
<row><elem>3</elem><elem>2</elem></row>
<row><elem>5</elem><elem>6</elem></row>
...
Or even
<states><state>Alabama</state><state>Alaska</state></states>
<population><pop>34.2</pop><pop>10.6</pop></population>
....
(an array containing state names, population and other data, 
perhaps serialized in the file as all info for each state together).
If DFDL could separate the reading of such an array from how 
it is output in the schema, I could do any of this. Having 
multiple layers is a start - DFDL reads the array in to 
something that is addressable along the lines Mike proposes 
and then the contents of that layer are referenced via xpath 
to provide values in some structure I define in XSD. The only 
piece missing (I think) is that we haven't yet defined how to 
access iterators, i.e. if I have an element <elem minoccurs="1"
maxoccurs = "5"> , how can I say that element n (n = 1...5) 
has dfdl:runtimevalue <a x="n" y="1">, which would put just 
the first column of a into the element sequence. If, in 
Mike's example, I could define the x and y dimensions 
independent of an array-reading context, just so I can use 
them in value references for dfdl:runtimevalue elements, I 
think we'd be all set.
This type of capability would allow all sorts of useful 
things - including the array to set of vectors conversions 
outlined here as well as subsampling, expansion/contraction 
of sprase arrays (where the array is stored as a sequence of 
x,y, value triples for only nonzero elements), etc.
One other minor point - if the order of x and y in the DFDL 
file is important (as it is in the example), do we need a 
<dfdl:array storageOrder="firstDimensionChangesFirst"> 
option? OR can we just list y first and then x?
Jim

RE: [dfdl-wg] How to handle multi-dimensional arrays - version 2

mike.beckerle＠ascentialsoftware.com