I tend to agree that there 2 inherent
logical structures in this scenario. DFDL scope in my option should
be restricted to parsing the physical stream and populating the logical
structure which is complaint with the structure of physical stream and
vice versa. We have numerous options and technologies (XSLT, XSD<->XSD
mappers, good old programming languages, Xquery) which do pretty good job
to transform one logical structure to another logical structure. Building
some kinds of annotations which would allow a physical stream to map to
a completely different logical structure will make the DFDL language very
complex.
Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools
Tel : 905-413-3923 T/L 969-3923
Fax : 905-413-4850
Internet ID : kalia@ca.ibm.com
----- Forwarded by Suman
Kalia/Toronto/IBM on 11/19/2004 11:36 AM -----
"Myers, James D"
<jim.myers@pnl.gov> Sent by: owner-dfdl-wg@ggf.org
11/19/2004 11:05 AM
To
dfdl-wg@gridforum.org
cc
Subject
RE: [dfdl-wg] simple way
to study hard DFDL example problem - IBMFormat VS rec
ords as XML
I was thinking that step
1 involved recognizing the <first/> and <data> elements and
creating a sequence of <myfirst>here's the data</myfirst>,
<mymiddle>more data</mymiddle> and <mylast>... elements
and then assembling the new layer by some sort of choice to concatenate
the relevant myfirst, optional mymiddle, and myend elements for each item.
I think that requires a way to
make a choice based on the <first/>, <middle/>, <last/>
elements and populate either a <myfirst>, <mymiddle>, or <mylast>
elements (all subtypes of string?) with the contents of the following data
element, which I think we can do in DFDL. This is just our standard choice
flag that decides which of several options exist.
Then, I think you'd need logic
to decide how many elements represent one item, which I think we have,
followed by a way to concatenate these elements to produce a string source,
which again I think we have (same as saying a complex can be built from
two floats referenced from another layer instead of from a float stream).
This part is the same problem as having a text file where one <CR>
separates lines and <CR><CR> separates paragraphs and you want
to create single strings (from a variable number of lines) for each paragraph.
Again, I won't argue that this
is simple and fun, but I think the machinery exists and is the same as
that from our simple examples.
Jim
-----Original Message-----
From: owner-dfdl-wg@ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf
Of mike.beckerle@ascentialsoftware.com
Sent: Friday, November 19, 2004 10:44 AM
To: Myers, James D; dfdl-wg@gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem
- IBMFormat VS rec ords as XML
You are thinking along the lines
I was; however, the challenge is that I cannot find a way to do this using
multilayer so I'm uncomfortable suggesting that it's possible at all anymore.
Here's some reasoning why.
In particular, it's the intersection
of the induction across the items with the first, middle*, last thing,
and the spanning that seems to defy my efforts to cut it up into progressive
transformation layer by layer. In some conversations I've referred to this
problem as the "non-conforming trees" problem. The fundamental
shapes of the trees are not compatible, and expressing the transformation
between them isn't easily done via induction of any kind on one or the
other of the trees.
To me the First, Middle*, Last
thing is very problematic. It's effectively a little regular language (in
the formal sense) that has to be recognized. Generally this requires a
finite-state-machine, and what makes FSMs interesting and complex is always
the way you diagnose malformed data in addition to recognizing correct
data.
Now, a finite-state-machine is,
to my mind, the ultimate procedural abstraction, the quintessential opposite
of "declarative" expression. To be declarative about a FSM you
end up saying "recognize this regular language", and providing
a description of the regular language, which is of course, just begging
the question of how it actually works.
(And for us, we're not really
talking about a regular language of character text, but a pattern of usage
in the binary data layout that obeys the pattern of a regular language.
So it's not like having a little regular expression thing for validating
text strings helps with this problem.)
I guess I'm arguing that a black
box approach to this is not only acceptable, but is highly likely to be
the only "good" way to do it. In light of this I've suggested
a rep property called "streamFormat" (perhaps should be renamed
"recordFormat"), which gets values from the set VS, V, VBS, FB,
FBS, etc. etc. all these well-defined legacy data formats (there are 19
of them I think). In additon, one should be able to extend this by
introduction of a blackbox transformation.
And ... here's the rub...if that's
true for this case, then other "hard" examples like run-length
encoding seem also in this category.
There's several "leaps of
faith" just made in these arguments, so i'd still like people to take
this "XML challenge" and see if there's some magic I'm overlooking.
...mikeb
From: Myers, James D [mailto:jim.myers@pnl.gov]
Sent: Friday, November 19, 2004 9:52 AM
To: dfdl-wg@gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem
- IBM Format VS rec ords as XML
Without digging too much into the details,
I'd say this is an example where multi-layer comes in. The DFDL would describe
a hidden layer in which the first, middle, last data elements would be
identified and put into a list, and then that hidden list would be used
as the input to create items in the output layer.
I think this is conceptually similar to
one of our run-length encoding examples (more complex of course). If you
read a sequence if ints and then a sequence of floats and need to output
a sequence of floats with int[i] repeats of float[i], it would be easiest
to create a hidden layer representing the int and float sequences and to
then produce output from that. If you don't think about a layer, even this
example gets painful - I need to read an int, skip forward somewhere to
find a float, skip back to get the next int, etc.
Mike's full example, not starting with the
XML-ized version, might be something that requires more than one layer
- read the original into something with with XML schema Mike defines, then
a layer making a sequence of data elements, and then something that has
the desired logical output.
I guess I would claim that this would not
be too bad a way to describe a fairly complex format in terms of a fairly
different logical structure. Whether one *should* do this in DFDL, or whether
it would make more sense to a) write a black box parser to get to items,
or b) use DFDL to get to the initial schema Mike wrote and use XSLT afterwards
to convert to the desired logical structure. I think there are enough cases
where we need the multilayer functionality in DFDL that are relatively
simple that we have to have it, which means it will then be possible to
deal with complex transformations in DFDL even if not simple/practical.
Jim
-----Original
Message-----
From: owner-dfdl-wg@ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf
Of mike.beckerle@ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg@gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem -
IBM Format VS rec ords as XML
I've come up with a way to articulate
the difficulties I'm having with DFDL for complex file formats.
This problem may not be that hard
for someone with more XML, XPath or XQuery experience, so I'd apprecate
it if you could look it over and if necessary even run it by your resident
XML experts.
In case the emailer mangles all
the line lengths, I've also attached the below as a file.
<!-- Example motivated by DFDL
for IBM Format-VS -->
<!-- see http://tinyurl.com/3s2bq
for details on IBM Format-VS -->
<!-- Logically, our data is
this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this
"logical" schema -->
<!-- But the below is the input
data were starting from. What you see below simulates
the structural issues of IBM Format-VS, but converting the
problem into an XML to XML
transformation problem -->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item
(Duh!). This element is really a type tag. -->
<DATA>The first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part
of an item. -->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from
the center of an item -->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the
end of the item. -->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this block is
a WHOLE segment. However
in general the
2nd segment of a block could be a WHOLE or the
FIRST segment
of another multi-segment multi-block spanning item -->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or more
SEGMENTs -->
<!-- Spanning data is broken on arbitrary boundaries across segments
it spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure. -->
<!-- MIDDLE* means zero or more MIDDLE segments. -->
<!-- The question: how can
we express the transformation into the desired logical form?
Or is this beyond the call of duty for DFDL?
Goals include to be as declarative as possible, and ideally,
do it as a set of
XML Schema annotations in the GGF DFDL style. -->
<!-- here's an XSD (untested)
for the input data structure -->