I've
come up with a way to articulate the difficulties I'm having with DFDL for
complex file formats.
This
problem may not be that hard for someone with more XML, XPath or
XQuery experience, so I'd apprecate it if you could look it over and if
necessary even run it by your resident XML experts.
In
case the emailer mangles all the line lengths, I've also attached the below as a
file.
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second
item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this "logical" schema
-->
<sequence>
<element name="ITEM" type="string"
minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were starting from. What you see
below simulates
the structural issues of IBM
Format-VS, but converting the problem into an XML to
XML
transformation problem
-->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This element
is really a type tag. -->
<DATA>The first
item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of an item.
-->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center of an item
-->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of the item.
-->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this block is a WHOLE segment.
However
in general the 2nd segment of a block could be a WHOLE or the
FIRST segment of another multi-segment multi-block spanning item
-->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into
BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!--
Each SEGMENT is either a WHOLE item, or the item spans 2 or more SEGMENTs
-->
<!-- Spanning data is broken on arbitrary boundaries across
segments it spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST
segment structure. -->
<!-- MIDDLE* means zero or more MIDDLE segments.
-->
<!-- The question: how can we express the transformation into the
desired logical form?
Or is this beyond the call of
duty for DFDL?
Goals include to be as declarative as
possible, and ideally, do it as a set of
XML Schema
annotations in the GGF DFDL style. -->
<!-- here's an XSD (untested) for the input data structure
-->
<complexType
name="Format_VS_t">
<sequence>
<element
name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType name="Block_t">
<sequence>
<element
name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
</sequence>
</complexType>
<complexType name="Segment_t">
<sequence>
<choice>
<element
name="WHOLE">
</element>
<element name="FIRST">
</element>
<element
name="LAST">
</element>
<element name="MIDDLE">
</element>
</choice>
<element name="DATA"
type="string"/>
</sequence>
</complexType>