You are thinking along the lines I was; however, the
challenge is that I cannot find a way to do this using multilayer so I'm
uncomfortable suggesting that it's possible at all anymore. Here's some
reasoning why.
In particular, it's the intersection of the induction
across the items with the first, middle*, last thing, and the spanning that
seems to defy my efforts to cut it up into progressive transformation layer by
layer. In some conversations I've referred to this problem as the
"non-conforming trees" problem. The fundamental shapes of the trees are not
compatible, and expressing the transformation between them isn't easily done
via induction of any kind on one or the other of the
trees.
To me the First, Middle*, Last thing is very problematic.
It's effectively a little regular language (in the formal sense) that has to
be recognized. Generally this requires a finite-state-machine, and what makes
FSMs interesting and complex is always the way you diagnose malformed
data in addition to recognizing correct data.
Now, a finite-state-machine is, to my mind, the ultimate
procedural abstraction, the quintessential opposite of "declarative"
expression. To be declarative about a FSM you end up saying "recognize this
regular language", and providing a description of the regular language, which
is of course, just begging the question of how it actually works.
(And for us, we're not really talking about a regular
language of character text, but a pattern of usage in the binary data layout
that obeys the pattern of a regular language. So it's not like having a little
regular expression thing for validating text strings helps with this
problem.)
I guess I'm arguing that a black box approach to this is
not only acceptable, but is highly likely to be the only "good" way to do it.
In light of this I've suggested a rep property called "streamFormat" (perhaps
should be renamed "recordFormat"), which gets values from the set VS, V, VBS,
FB, FBS, etc. etc. all these well-defined legacy data formats (there are 19 of
them I think). In additon, one should be able to extend this by
introduction of a blackbox transformation.
And ... here's the rub...if that's true for this case,
then other "hard" examples like run-length encoding seem also in this
category.
There's several "leaps of faith" just made in these
arguments, so i'd still like people to take this "XML challenge" and see if
there's some magic I'm overlooking.
...mikeb
Without digging too
much into the details, I'd say this is an example where multi-layer comes
in. The DFDL would describe a hidden layer in which the first, middle, last
data elements would be identified and put into a list, and then that hidden
list would be used as the input to create items in the output
layer.
I think this is
conceptually similar to one of our run-length encoding examples (more
complex of course). If you read a sequence if ints and then a sequence of
floats and need to output a sequence of floats with int[i] repeats of
float[i], it would be easiest to create a hidden layer representing the int
and float sequences and to then produce output from that. If you don't think
about a layer, even this example gets painful - I need to read an int, skip
forward somewhere to find a float, skip back to get the next int,
etc.
Mike's full
example, not starting with the XML-ized version, might be something
that requires more than one layer - read the original into something with
with XML schema Mike defines, then a layer making a sequence of data
elements, and then something that has the desired logical
output.
I guess I would
claim that this would not be too bad a way to describe a fairly complex
format in terms of a fairly different logical structure. Whether one
*should* do this in DFDL, or whether it would make more sense to a) write a
black box parser to get to items, or b) use DFDL to get to the initial
schema Mike wrote and use XSLT afterwards to convert to the desired logical
structure. I think there are enough cases where we need the multilayer
functionality in DFDL that are relatively simple that we have to have it,
which means it will then be possible to deal with complex transformations in
DFDL even if not simple/practical.
Jim
-----Original
Message-----
From: owner-dfdl-wg@ggf.org
[mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle@ascentialsoftware.com
Sent: Thursday, November
18, 2004 9:53 PM
To: dfdl-wg@gridforum.org
Subject:
[dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec
ords as XML
I've come up with a way to articulate the difficulties I'm having
with DFDL for complex file formats.
This problem may not be that hard for someone with more XML, XPath
or XQuery experience, so I'd apprecate it if you could look it over
and if necessary even run it by your resident XML
experts.
In case the emailer mangles all the line lengths, I've also
attached the below as a file.
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the
second item</ITEM>
<ITEM>The
third</ITEM>
<!-- That is, data having this "logical" schema
-->
<sequence>
<element name="ITEM" type="string"
minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were starting from. What
you see below simulates
the structural issues
of IBM Format-VS, but converting the problem into an XML to
XML
transformation problem
-->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This
element is really a type tag. -->
<DATA>The
first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of an item.
-->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center of an
item -->
<DATA>s is
t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he
sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of the
item. -->
<DATA>ond
item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second
segment in this block is a WHOLE segment. However
in general the 2nd segment of a block could be a WHOLE or the
FIRST segment of another multi-segment multi-block spanning item
-->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into
BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs
-->
<!-- Each SEGMENT is either a WHOLE item, or the item spans 2
or more SEGMENTs -->
<!-- Spanning data is broken on arbitrary
boundaries across segments it spans -->
<!-- Spanning involves a
FIRST, MIDDLE*, LAST segment structure. -->
<!-- MIDDLE* means
zero or more MIDDLE segments. -->
<!-- The question: how can we express the transformation into
the desired logical form?
Or is this beyond
the call of duty for DFDL?
Goals include to be
as declarative as possible, and ideally, do it as a set
of
XML Schema annotations in the GGF DFDL
style. -->
<!-- here's an XSD (untested) for the input data structure
-->
<complexType
name="Format_VS_t">
<sequence>
<element name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType
name="Block_t">
<sequence>
<element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
</sequence>
</complexType>
<complexType
name="Segment_t">
<sequence>
<choice>
<element
name="WHOLE">
</element>
<element
name="FIRST">
</element>
<element
name="LAST">
</element>
<element
name="MIDDLE">
</element>
</choice>
<element name="DATA"
type="string"/>
</sequence>
</complexType>