Without digging too much into the
details, I'd say this is an example where multi-layer comes in. The DFDL would
describe a hidden layer in which the first, middle, last data elements would be
identified and put into a list, and then that hidden list would be used as the
input to create items in the output layer.
I think this is
conceptually similar to one of our run-length encoding examples (more complex of
course). If you read a sequence if ints and then a sequence of floats and need
to output a sequence of floats with int[i] repeats of float[i], it would be
easiest to create a hidden layer representing the int and float sequences and to
then produce output from that. If you don't think about a layer, even this
example gets painful - I need to read an int, skip forward somewhere to find a
float, skip back to get the next int, etc.
Mike's full
example, not starting with the XML-ized version, might be something that
requires more than one layer - read the original into something with with XML
schema Mike defines, then a layer making a sequence of data elements, and then
something that has the desired logical output.
I guess I would
claim that this would not be too bad a way to describe a fairly complex format
in terms of a fairly different logical structure. Whether one *should* do this
in DFDL, or whether it would make more sense to a) write a black box parser to
get to items, or b) use DFDL to get to the initial schema Mike wrote and use
XSLT afterwards to convert to the desired logical structure. I think there are
enough cases where we need the multilayer functionality in DFDL that are
relatively simple that we have to have it, which means it will then be possible
to deal with complex transformations in DFDL even if not
simple/practical.
Jim
-----Original
Message-----
From: owner-dfdl-wg@ggf.org
[mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle@ascentialsoftware.com
Sent: Thursday, November 18,
2004 9:53 PM
To: dfdl-wg@gridforum.org
Subject: [dfdl-wg]
simple way to study hard DFDL example problem - IBM Format VS rec ords as
XML
I've
come up with a way to articulate the difficulties I'm having with DFDL for
complex file formats.
This
problem may not be that hard for someone with more XML, XPath or
XQuery experience, so I'd apprecate it if you could look it over and if
necessary even run it by your resident XML experts.
In
case the emailer mangles all the line lengths, I've also attached the below as
a file.
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the
second item</ITEM>
<ITEM>The
third</ITEM>
<!-- That is, data having this "logical" schema
-->
<sequence>
<element name="ITEM" type="string"
minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were starting from. What you
see below simulates
the structural issues of IBM
Format-VS, but converting the problem into an XML to
XML
transformation problem
-->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This element
is really a type tag. -->
<DATA>The first
item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of an item.
-->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center of an item
-->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he
sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of the
item. -->
<DATA>ond
item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second
segment in this block is a WHOLE segment. However
in general the 2nd segment of a block could be a WHOLE or the
FIRST segment of another multi-segment multi-block spanning item
-->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into
BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!--
Each SEGMENT is either a WHOLE item, or the item spans 2 or more SEGMENTs
-->
<!-- Spanning data is broken on arbitrary boundaries across
segments it spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST
segment structure. -->
<!-- MIDDLE* means zero or more MIDDLE
segments. -->
<!-- The question: how can we express the transformation into the
desired logical form?
Or is this beyond the call
of duty for DFDL?
Goals include to be as
declarative as possible, and ideally, do it as a set
of
XML Schema annotations in the GGF DFDL
style. -->
<!-- here's an XSD (untested) for the input data structure
-->
<complexType
name="Format_VS_t">
<sequence>
<element
name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType name="Block_t">
<sequence>
<element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
</sequence>
</complexType>
<complexType
name="Segment_t">
<sequence>
<choice>
<element
name="WHOLE">
</element>
<element name="FIRST">
</element>
<element
name="LAST">
</element>
<element name="MIDDLE">
</element>
</choice>
<element name="DATA"
type="string"/>
</sequence>
</complexType>