dfdl-wg
Threads by month
- ----- 2025 -----
- July
- June
- May
- April
- March
- February
- January
- ----- 2024 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2023 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2022 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2021 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2020 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2019 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2018 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2017 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2016 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2015 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2014 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2013 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2012 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2011 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2010 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2009 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2008 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2007 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2006 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2005 -----
- December
- November
- October
- September
- August
- July
- June
- May
- April
- March
- February
- January
- ----- 2004 -----
- December
- November
- 3032 discussions

Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Suman Kalia 19 Nov '04
by Suman Kalia 19 Nov '04
19 Nov '04
Jim -- I agree with most of your assertions and you have phrased it right
"relatively compliant with physical structure". Some of these examples
from programming languages would be " COBOL occur depending upon clause"
and as you mentioned in the example "a previous value in the structure
indicating which field in the choice will be present or how many
occurrences a subsequent field will have" etc.. These are the most
common kind of constructs that occur quite frequently in the programming
structures.
I think DFDL standard is addressing a very critical requirement "rendering
a logical structure to a relatively compliant physical format and vice
versa" which no other public standard has addressed so far to my
knowledge and this work is/will be very complimentary with other
standards.
Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools
Tel : 905-413-3923 T/L 969-3923
Fax : 905-413-4850
Internet ID : kalia(a)ca.ibm.com
----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 02:05 PM -----
"Myers, James D" <jim.myers(a)pnl.gov>
Sent by: owner-dfdl-wg(a)ggf.org
11/19/2004 12:34 PM
To
dfdl-wg(a)gridforum.org
cc
Subject
RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS
rec ords as XML
Unfortuantely, there's a slippery slope here - there are no ints on the
disk, just logical ones and zeros that you can transform into a second
logical structure composed of ints, assuming you specify byte order. I
think we have a whole stream of examples beyond that - removing
delimiters, using a length prefix to define the length of a subsequent
structure, etc. - that we see as minor transformations to something still
relatively "compliant" with the physical structure, but, I believe,
require the same machinery as things I think we will all agree are beyond
the scope of what DFDL should aim for.
In practice, I think people should get out of DFDL as soon as possible
just as you say - use other technologies once you get an initial
structure. But I think there are cases where you have to stay in DFDL -
anything where I have to transform the initial physically-compliant
structure to interpret subsequent fields - x and y ints tell me how many
pixel repeats, an int greater than another int read previsouly implies a
different subsequent structure, etc. And again, the minimal mechinery to
do that lets you go farther than you'd want people to go in practice.
There may also be reasonable use cases where the ability to stay in DFDL
is important. For example, take digital preservation, where I might want
to map all document files to a standardized schema, regardless of whether
it was word, pdf, etc. Being able to specify the full descriptions in one
file that then requires only one parser to interpret all formats *might*
be worth the cost to do complex things in DFDL. I don't think our goal for
a version 1 should be to support such use, but I don't think we can meet
our simple goals without 'accidentally' making it possible.
I'd be happy to be proved wrong - seems like a deep point that would be
cool to understand. I'm not sure how we get to a 'proof' though - we're
trying to prove that there exists something DFDL as currently formulated
can't describe. So - we either need to find that example or turn to some
sort of logic formalism to discover what primitive(s) we're missing that
keep us for emulating some class of parser/programming. (Or find something
in DFDL that we don't need to support the examples we do want to
target...).
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
Suman Kalia
Sent: Friday, November 19, 2004 11:50 AM
To: dfdl-wg(a)gridforum.org
Subject: Fw: [dfdl-wg] simple way to study hard DFDL example problem -
IBMFormat VS rec ords as XML
I tend to agree that there 2 inherent logical structures in this scenario.
DFDL scope in my option should be restricted to parsing the physical
stream and populating the logical structure which is complaint with the
structure of physical stream and vice versa. We have numerous options and
technologies (XSLT, XSD<->XSD mappers, good old programming languages,
Xquery) which do pretty good job to transform one logical structure to
another logical structure. Building some kinds of annotations which would
allow a physical stream to map to a completely different logical structure
will make the DFDL language very complex.
Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools
Tel : 905-413-3923 T/L 969-3923
Fax : 905-413-4850
Internet ID : kalia(a)ca.ibm.com
----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 11:36 AM -----
"Myers, James D" <jim.myers(a)pnl.gov>
Sent by: owner-dfdl-wg(a)ggf.org
11/19/2004 11:05 AM
To
dfdl-wg(a)gridforum.org
cc
Subject
RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS
rec ords as XML
I was thinking that step 1 involved recognizing the <first/> and <data>
elements and creating a sequence of <myfirst>here's the data</myfirst>,
<mymiddle>more data</mymiddle> and <mylast>... elements and then
assembling the new layer by some sort of choice to concatenate the
relevant myfirst, optional mymiddle, and myend elements for each item.
I think that requires a way to make a choice based on the <first/>,
<middle/>, <last/> elements and populate either a <myfirst>, <mymiddle>,
or <mylast> elements (all subtypes of string?) with the contents of the
following data element, which I think we can do in DFDL. This is just our
standard choice flag that decides which of several options exist.
Then, I think you'd need logic to decide how many elements represent one
item, which I think we have, followed by a way to concatenate these
elements to produce a string source, which again I think we have (same as
saying a complex can be built from two floats referenced from another
layer instead of from a float stream). This part is the same problem as
having a text file where one <CR> separates lines and <CR><CR> separates
paragraphs and you want to create single strings (from a variable number
of lines) for each paragraph.
Again, I won't argue that this is simple and fun, but I think the
machinery exists and is the same as that from our simple examples.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle(a)ascentialsoftware.com
Sent: Friday, November 19, 2004 10:44 AM
To: Myers, James D; dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem -
IBMFormat VS rec ords as XML
You are thinking along the lines I was; however, the challenge is that I
cannot find a way to do this using multilayer so I'm uncomfortable
suggesting that it's possible at all anymore. Here's some reasoning why.
In particular, it's the intersection of the induction across the items
with the first, middle*, last thing, and the spanning that seems to defy
my efforts to cut it up into progressive transformation layer by layer. In
some conversations I've referred to this problem as the "non-conforming
trees" problem. The fundamental shapes of the trees are not compatible,
and expressing the transformation between them isn't easily done via
induction of any kind on one or the other of the trees.
To me the First, Middle*, Last thing is very problematic. It's effectively
a little regular language (in the formal sense) that has to be recognized.
Generally this requires a finite-state-machine, and what makes FSMs
interesting and complex is always the way you diagnose malformed data in
addition to recognizing correct data.
Now, a finite-state-machine is, to my mind, the ultimate procedural
abstraction, the quintessential opposite of "declarative" expression. To
be declarative about a FSM you end up saying "recognize this regular
language", and providing a description of the regular language, which is
of course, just begging the question of how it actually works.
(And for us, we're not really talking about a regular language of
character text, but a pattern of usage in the binary data layout that
obeys the pattern of a regular language. So it's not like having a little
regular expression thing for validating text strings helps with this
problem.)
I guess I'm arguing that a black box approach to this is not only
acceptable, but is highly likely to be the only "good" way to do it. In
light of this I've suggested a rep property called "streamFormat" (perhaps
should be renamed "recordFormat"), which gets values from the set VS, V,
VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there
are 19 of them I think). In additon, one should be able to extend this by
introduction of a blackbox transformation.
And ... here's the rub...if that's true for this case, then other "hard"
examples like run-length encoding seem also in this category.
There's several "leaps of faith" just made in these arguments, so i'd
still like people to take this "XML challenge" and see if there's some
magic I'm overlooking.
...mikeb
From: Myers, James D [mailto:jim.myers@pnl.gov]
Sent: Friday, November 19, 2004 9:52 AM
To: dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
Without digging too much into the details, I'd say this is an example
where multi-layer comes in. The DFDL would describe a hidden layer in
which the first, middle, last data elements would be identified and put
into a list, and then that hidden list would be used as the input to
create items in the output layer.
I think this is conceptually similar to one of our run-length encoding
examples (more complex of course). If you read a sequence if ints and then
a sequence of floats and need to output a sequence of floats with int[i]
repeats of float[i], it would be easiest to create a hidden layer
representing the int and float sequences and to then produce output from
that. If you don't think about a layer, even this example gets painful - I
need to read an int, skip forward somewhere to find a float, skip back to
get the next int, etc.
Mike's full example, not starting with the XML-ized version, might be
something that requires more than one layer - read the original into
something with with XML schema Mike defines, then a layer making a
sequence of data elements, and then something that has the desired logical
output.
I guess I would claim that this would not be too bad a way to describe a
fairly complex format in terms of a fairly different logical structure.
Whether one *should* do this in DFDL, or whether it would make more sense
to a) write a black box parser to get to items, or b) use DFDL to get to
the initial schema Mike wrote and use XSLT afterwards to convert to the
desired logical structure. I think there are enough cases where we need
the multilayer functionality in DFDL that are relatively simple that we
have to have it, which means it will then be possible to deal with complex
transformations in DFDL even if not simple/practical.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle(a)ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg(a)gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
I've come up with a way to articulate the difficulties I'm having with
DFDL for complex file formats.
This problem may not be that hard for someone with more XML, XPath or
XQuery experience, so I'd apprecate it if you could look it over and if
necessary even run it by your resident XML experts.
In case the emailer mangles all the line lengths, I've also attached the
below as a file.
<!-- Example motivated by DFDL for IBM Format-VS -->
<!-- see http://tinyurl.com/3s2bq for details on IBM Format-VS -->
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this "logical" schema -->
<sequence>
<element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were starting from. What you see
below simulates
the structural issues of IBM Format-VS, but converting the problem into
an XML to XML
transformation problem -->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This element is
really a type tag. -->
<DATA>The first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of an item. -->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center of an item
-->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of the item. -->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this block is a WHOLE segment.
However
in general the 2nd segment of a block could be a WHOLE or
the
FIRST segment of another multi-segment multi-block spanning
item -->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or more
SEGMENTs -->
<!-- Spanning data is broken on arbitrary boundaries across segments it
spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure. -->
<!-- MIDDLE* means zero or more MIDDLE segments. -->
<!-- The question: how can we express the transformation into the desired
logical form?
Or is this beyond the call of duty for DFDL?
Goals include to be as declarative as possible, and ideally, do it as a
set of
XML Schema annotations in the GGF DFDL style. -->
<!-- here's an XSD (untested) for the input data structure -->
<complexType name="Format_VS_t">
<sequence>
<element name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType name="Block_t">
<sequence>
<element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
</sequence>
</complexType>
<complexType name="Segment_t">
<sequence>
<choice>
<element name="WHOLE">
</element>
<element name="FIRST">
</element>
<element name="LAST">
</element>
<element name="MIDDLE">
</element>
</choice>
<element name="DATA" type="string"/>
</sequence>
</complexType>
1
0

Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Suman Kalia 19 Nov '04
by Suman Kalia 19 Nov '04
19 Nov '04
Jim -- I agree with most of your assertions and you have phrased it right
"relatively compliant with physical structure". Some of these examples
from programming languages would be " COBOL occur depending upon clause"
and as you mentioned in the example "a previous value in the structure
indicating which field in the choice will be present or how many
occurrences a subsequent field will have" etc.. These are the most
common kind of constructs that occur quite frequently in the programming
structures.
I think DFDL standard is addressing a very critical requirement "rendering
a logical structure to physical format and vice versa" which no other
public standard has addressed so far to my knowledge and this work
is/will be very complimentary with other standards.
Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools
Tel : 905-413-3923 T/L 969-3923
Fax : 905-413-4850
Internet ID : kalia(a)ca.ibm.com
----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 02:05 PM -----
"Myers, James D" <jim.myers(a)pnl.gov>
Sent by: owner-dfdl-wg(a)ggf.org
11/19/2004 12:34 PM
To
dfdl-wg(a)gridforum.org
cc
Subject
RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS
rec ords as XML
Unfortuantely, there's a slippery slope here - there are no ints on the
disk, just logical ones and zeros that you can transform into a second
logical structure composed of ints, assuming you specify byte order. I
think we have a whole stream of examples beyond that - removing
delimiters, using a length prefix to define the length of a subsequent
structure, etc. - that we see as minor transformations to something still
relatively "compliant" with the physical structure, but, I believe,
require the same machinery as things I think we will all agree are beyond
the scope of what DFDL should aim for.
In practice, I think people should get out of DFDL as soon as possible
just as you say - use other technologies once you get an initial
structure. But I think there are cases where you have to stay in DFDL -
anything where I have to transform the initial physically-compliant
structure to interpret subsequent fields - x and y ints tell me how many
pixel repeats, an int greater than another int read previsouly implies a
different subsequent structure, etc. And again, the minimal mechinery to
do that lets you go farther than you'd want people to go in practice.
There may also be reasonable use cases where the ability to stay in DFDL
is important. For example, take digital preservation, where I might want
to map all document files to a standardized schema, regardless of whether
it was word, pdf, etc. Being able to specify the full descriptions in one
file that then requires only one parser to interpret all formats *might*
be worth the cost to do complex things in DFDL. I don't think our goal for
a version 1 should be to support such use, but I don't think we can meet
our simple goals without 'accidentally' making it possible.
I'd be happy to be proved wrong - seems like a deep point that would be
cool to understand. I'm not sure how we get to a 'proof' though - we're
trying to prove that there exists something DFDL as currently formulated
can't describe. So - we either need to find that example or turn to some
sort of logic formalism to discover what primitive(s) we're missing that
keep us for emulating some class of parser/programming. (Or find something
in DFDL that we don't need to support the examples we do want to
target...).
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
Suman Kalia
Sent: Friday, November 19, 2004 11:50 AM
To: dfdl-wg(a)gridforum.org
Subject: Fw: [dfdl-wg] simple way to study hard DFDL example problem -
IBMFormat VS rec ords as XML
I tend to agree that there 2 inherent logical structures in this scenario.
DFDL scope in my option should be restricted to parsing the physical
stream and populating the logical structure which is complaint with the
structure of physical stream and vice versa. We have numerous options and
technologies (XSLT, XSD<->XSD mappers, good old programming languages,
Xquery) which do pretty good job to transform one logical structure to
another logical structure. Building some kinds of annotations which would
allow a physical stream to map to a completely different logical structure
will make the DFDL language very complex.
Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools
Tel : 905-413-3923 T/L 969-3923
Fax : 905-413-4850
Internet ID : kalia(a)ca.ibm.com
----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 11:36 AM -----
"Myers, James D" <jim.myers(a)pnl.gov>
Sent by: owner-dfdl-wg(a)ggf.org
11/19/2004 11:05 AM
To
dfdl-wg(a)gridforum.org
cc
Subject
RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS
rec ords as XML
I was thinking that step 1 involved recognizing the <first/> and <data>
elements and creating a sequence of <myfirst>here's the data</myfirst>,
<mymiddle>more data</mymiddle> and <mylast>... elements and then
assembling the new layer by some sort of choice to concatenate the
relevant myfirst, optional mymiddle, and myend elements for each item.
I think that requires a way to make a choice based on the <first/>,
<middle/>, <last/> elements and populate either a <myfirst>, <mymiddle>,
or <mylast> elements (all subtypes of string?) with the contents of the
following data element, which I think we can do in DFDL. This is just our
standard choice flag that decides which of several options exist.
Then, I think you'd need logic to decide how many elements represent one
item, which I think we have, followed by a way to concatenate these
elements to produce a string source, which again I think we have (same as
saying a complex can be built from two floats referenced from another
layer instead of from a float stream). This part is the same problem as
having a text file where one <CR> separates lines and <CR><CR> separates
paragraphs and you want to create single strings (from a variable number
of lines) for each paragraph.
Again, I won't argue that this is simple and fun, but I think the
machinery exists and is the same as that from our simple examples.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle(a)ascentialsoftware.com
Sent: Friday, November 19, 2004 10:44 AM
To: Myers, James D; dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem -
IBMFormat VS rec ords as XML
You are thinking along the lines I was; however, the challenge is that I
cannot find a way to do this using multilayer so I'm uncomfortable
suggesting that it's possible at all anymore. Here's some reasoning why.
In particular, it's the intersection of the induction across the items
with the first, middle*, last thing, and the spanning that seems to defy
my efforts to cut it up into progressive transformation layer by layer. In
some conversations I've referred to this problem as the "non-conforming
trees" problem. The fundamental shapes of the trees are not compatible,
and expressing the transformation between them isn't easily done via
induction of any kind on one or the other of the trees.
To me the First, Middle*, Last thing is very problematic. It's effectively
a little regular language (in the formal sense) that has to be recognized.
Generally this requires a finite-state-machine, and what makes FSMs
interesting and complex is always the way you diagnose malformed data in
addition to recognizing correct data.
Now, a finite-state-machine is, to my mind, the ultimate procedural
abstraction, the quintessential opposite of "declarative" expression. To
be declarative about a FSM you end up saying "recognize this regular
language", and providing a description of the regular language, which is
of course, just begging the question of how it actually works.
(And for us, we're not really talking about a regular language of
character text, but a pattern of usage in the binary data layout that
obeys the pattern of a regular language. So it's not like having a little
regular expression thing for validating text strings helps with this
problem.)
I guess I'm arguing that a black box approach to this is not only
acceptable, but is highly likely to be the only "good" way to do it. In
light of this I've suggested a rep property called "streamFormat" (perhaps
should be renamed "recordFormat"), which gets values from the set VS, V,
VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there
are 19 of them I think). In additon, one should be able to extend this by
introduction of a blackbox transformation.
And ... here's the rub...if that's true for this case, then other "hard"
examples like run-length encoding seem also in this category.
There's several "leaps of faith" just made in these arguments, so i'd
still like people to take this "XML challenge" and see if there's some
magic I'm overlooking.
...mikeb
From: Myers, James D [mailto:jim.myers@pnl.gov]
Sent: Friday, November 19, 2004 9:52 AM
To: dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
Without digging too much into the details, I'd say this is an example
where multi-layer comes in. The DFDL would describe a hidden layer in
which the first, middle, last data elements would be identified and put
into a list, and then that hidden list would be used as the input to
create items in the output layer.
I think this is conceptually similar to one of our run-length encoding
examples (more complex of course). If you read a sequence if ints and then
a sequence of floats and need to output a sequence of floats with int[i]
repeats of float[i], it would be easiest to create a hidden layer
representing the int and float sequences and to then produce output from
that. If you don't think about a layer, even this example gets painful - I
need to read an int, skip forward somewhere to find a float, skip back to
get the next int, etc.
Mike's full example, not starting with the XML-ized version, might be
something that requires more than one layer - read the original into
something with with XML schema Mike defines, then a layer making a
sequence of data elements, and then something that has the desired logical
output.
I guess I would claim that this would not be too bad a way to describe a
fairly complex format in terms of a fairly different logical structure.
Whether one *should* do this in DFDL, or whether it would make more sense
to a) write a black box parser to get to items, or b) use DFDL to get to
the initial schema Mike wrote and use XSLT afterwards to convert to the
desired logical structure. I think there are enough cases where we need
the multilayer functionality in DFDL that are relatively simple that we
have to have it, which means it will then be possible to deal with complex
transformations in DFDL even if not simple/practical.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle(a)ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg(a)gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
I've come up with a way to articulate the difficulties I'm having with
DFDL for complex file formats.
This problem may not be that hard for someone with more XML, XPath or
XQuery experience, so I'd apprecate it if you could look it over and if
necessary even run it by your resident XML experts.
In case the emailer mangles all the line lengths, I've also attached the
below as a file.
<!-- Example motivated by DFDL for IBM Format-VS -->
<!-- see http://tinyurl.com/3s2bq for details on IBM Format-VS -->
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this "logical" schema -->
<sequence>
<element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were starting from. What you see
below simulates
the structural issues of IBM Format-VS, but converting the problem
into an XML to XML
transformation problem -->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This element
is really a type tag. -->
<DATA>The first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of an item. -->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center of an item
-->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of the item. -->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this block is a WHOLE segment.
However
in general the 2nd segment of a block could be a WHOLE or
the
FIRST segment of another multi-segment multi-block
spanning item -->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or more
SEGMENTs -->
<!-- Spanning data is broken on arbitrary boundaries across segments it
spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure. -->
<!-- MIDDLE* means zero or more MIDDLE segments. -->
<!-- The question: how can we express the transformation into the desired
logical form?
Or is this beyond the call of duty for DFDL?
Goals include to be as declarative as possible, and ideally, do it as
a set of
XML Schema annotations in the GGF DFDL style. -->
<!-- here's an XSD (untested) for the input data structure -->
<complexType name="Format_VS_t">
<sequence>
<element name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType name="Block_t">
<sequence>
<element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
</sequence>
</complexType>
<complexType name="Segment_t">
<sequence>
<choice>
<element name="WHOLE">
</element>
<element name="FIRST">
</element>
<element name="LAST">
</element>
<element name="MIDDLE">
</element>
</choice>
<element name="DATA" type="string"/>
</sequence>
</complexType>
1
0
Hi,
I think I understand Suman's issue with annotations on the Schema tree.
(Please Suman tell me if I am right here). The problem is, that
lexically there are many trees in an XSD. Whilst in practice these can
clearly be considered as a single tree (including, I think, even the
simple type hierarchies) by placing all the type definitions inline,
this is not the way they appear to the user. So for example if I have a
file with conflicting annotations looking like:
<xs:complexType name="triple">
<xs:annotation>
<xs:appinfo>
<dfdlFromBinary/>
</xs:appinfo>
</xs:annotation>
<xs:sequence>
<xs:element name="first"/>
<xs:element name="second"/>
<xs:element name="third"/>
</xs:sequence>
</xs:complexType>
<xs:complexType name="data">
<xs:annotation>
<xs:appinfo>
<dfdlFromStrings/>
</xs:appinfo>
</xs:annotation>
<xs:sequence>
<xs:element name="triple"/>
</xs:sequence>
</xs:complexType>
So what I imagined is that we would assume that the "triple" type is
considered _inside_ the scope of the "data" type and so the
"dfdlFromBinary" tag wins.
On the other hand the user sees two trees of equal depth with
conflicting annotations. The examples can obviously get much more intricate.
The issue is really that the scope of the annotations is not lexically
defined. At some level this is just like having globally included
variables in a programming language. On the other hand we have arbitrary
levels of these.
Suman is this the problem?
If this is the problem, and we agree that it is too confusing to the
user (my opinion is still out on this). Then I see that the conclusion
is to adopt an approach similar to IBM's that annotations can appear
only on <element> and <attribute> tags. Even the top level of the file
is confusing since there may be many files involved. I guess we can also
have runtime defaults and default settings set in the standard. I don't
like this conclusion incidentally, can someone convince me it is the
wrong one?
Martin
1
1

RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Myers, James D 19 Nov '04
by Myers, James D 19 Nov '04
19 Nov '04
Unfortuantely, there's a slippery slope here - there are no ints
on the disk, just logical ones and zeros that you can transform into a
second logical structure composed of ints, assuming you specify byte
order. I think we have a whole stream of examples beyond that - removing
delimiters, using a length prefix to define the length of a subsequent
structure, etc. - that we see as minor transformations to something
still relatively "compliant" with the physical structure, but, I
believe, require the same machinery as things I think we will all agree
are beyond the scope of what DFDL should aim for.
In practice, I think people should get out of DFDL as soon as
possible just as you say - use other technologies once you get an
initial structure. But I think there are cases where you have to stay in
DFDL - anything where I have to transform the initial
physically-compliant structure to interpret subsequent fields - x and y
ints tell me how many pixel repeats, an int greater than another int
read previsouly implies a different subsequent structure, etc. And
again, the minimal mechinery to do that lets you go farther than you'd
want people to go in practice.
There may also be reasonable use cases where the ability to stay
in DFDL is important. For example, take digital preservation, where I
might want to map all document files to a standardized schema,
regardless of whether it was word, pdf, etc. Being able to specify the
full descriptions in one file that then requires only one parser to
interpret all formats *might* be worth the cost to do complex things in
DFDL. I don't think our goal for a version 1 should be to support such
use, but I don't think we can meet our simple goals without
'accidentally' making it possible.
I'd be happy to be proved wrong - seems like a deep point that
would be cool to understand. I'm not sure how we get to a 'proof' though
- we're trying to prove that there exists something DFDL as currently
formulated can't describe. So - we either need to find that example or
turn to some sort of logic formalism to discover what primitive(s) we're
missing that keep us for emulating some class of parser/programming. (Or
find something in DFDL that we don't need to support the examples we do
want to target...).
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On
Behalf Of Suman Kalia
Sent: Friday, November 19, 2004 11:50 AM
To: dfdl-wg(a)gridforum.org
Subject: Fw: [dfdl-wg] simple way to study hard DFDL example
problem - IBMFormat VS rec ords as XML
I tend to agree that there 2 inherent logical structures in this
scenario. DFDL scope in my option should be restricted to parsing the
physical stream and populating the logical structure which is complaint
with the structure of physical stream and vice versa. We have numerous
options and technologies (XSLT, XSD<->XSD mappers, good old programming
languages, Xquery) which do pretty good job to transform one logical
structure to another logical structure. Building some kinds of
annotations which would allow a physical stream to map to a completely
different logical structure will make the DFDL language very complex.
Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools
Tel : 905-413-3923 T/L 969-3923
Fax : 905-413-4850
Internet ID : kalia(a)ca.ibm.com
----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 11:36
AM -----
"Myers, James D" <jim.myers(a)pnl.gov>
Sent by: owner-dfdl-wg(a)ggf.org
11/19/2004 11:05 AM
To
dfdl-wg(a)gridforum.org
cc
Subject
RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat
VS rec ords as XML
I was thinking that step 1 involved recognizing the <first/>
and <data> elements and creating a sequence of <myfirst>here's the
data</myfirst>, <mymiddle>more data</mymiddle> and <mylast>... elements
and then assembling the new layer by some sort of choice to concatenate
the relevant myfirst, optional mymiddle, and myend elements for each
item.
I think that requires a way to make a choice based on the
<first/>, <middle/>, <last/> elements and populate either a <myfirst>,
<mymiddle>, or <mylast> elements (all subtypes of string?) with the
contents of the following data element, which I think we can do in DFDL.
This is just our standard choice flag that decides which of several
options exist.
Then, I think you'd need logic to decide how many elements
represent one item, which I think we have, followed by a way to
concatenate these elements to produce a string source, which again I
think we have (same as saying a complex can be built from two floats
referenced from another layer instead of from a float stream). This part
is the same problem as having a text file where one <CR> separates lines
and <CR><CR> separates paragraphs and you want to create single strings
(from a variable number of lines) for each paragraph.
Again, I won't argue that this is simple and fun, but I think
the machinery exists and is the same as that from our simple examples.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On
Behalf Of mike.beckerle(a)ascentialsoftware.com
Sent: Friday, November 19, 2004 10:44 AM
To: Myers, James D; dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example
problem - IBMFormat VS rec ords as XML
You are thinking along the lines I was; however, the challenge
is that I cannot find a way to do this using multilayer so I'm
uncomfortable suggesting that it's possible at all anymore. Here's some
reasoning why.
In particular, it's the intersection of the induction across the
items with the first, middle*, last thing, and the spanning that seems
to defy my efforts to cut it up into progressive transformation layer by
layer. In some conversations I've referred to this problem as the
"non-conforming trees" problem. The fundamental shapes of the trees are
not compatible, and expressing the transformation between them isn't
easily done via induction of any kind on one or the other of the trees.
To me the First, Middle*, Last thing is very problematic. It's
effectively a little regular language (in the formal sense) that has to
be recognized. Generally this requires a finite-state-machine, and what
makes FSMs interesting and complex is always the way you diagnose
malformed data in addition to recognizing correct data.
Now, a finite-state-machine is, to my mind, the ultimate
procedural abstraction, the quintessential opposite of "declarative"
expression. To be declarative about a FSM you end up saying "recognize
this regular language", and providing a description of the regular
language, which is of course, just begging the question of how it
actually works.
(And for us, we're not really talking about a regular language
of character text, but a pattern of usage in the binary data layout that
obeys the pattern of a regular language. So it's not like having a
little regular expression thing for validating text strings helps with
this problem.)
I guess I'm arguing that a black box approach to this is not
only acceptable, but is highly likely to be the only "good" way to do
it. In light of this I've suggested a rep property called "streamFormat"
(perhaps should be renamed "recordFormat"), which gets values from the
set VS, V, VBS, FB, FBS, etc. etc. all these well-defined legacy data
formats (there are 19 of them I think). In additon, one should be able
to extend this by introduction of a blackbox transformation.
And ... here's the rub...if that's true for this case, then
other "hard" examples like run-length encoding seem also in this
category.
There's several "leaps of faith" just made in these arguments,
so i'd still like people to take this "XML challenge" and see if there's
some magic I'm overlooking.
...mikeb
________________________________
From: Myers, James D [mailto:jim.myers@pnl.gov]
Sent: Friday, November 19, 2004 9:52 AM
To: dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example
problem - IBM Format VS rec ords as XML
Without digging too much into the details, I'd say this is an
example where multi-layer comes in. The DFDL would describe a hidden
layer in which the first, middle, last data elements would be identified
and put into a list, and then that hidden list would be used as the
input to create items in the output layer.
I think this is conceptually similar to one of our run-length
encoding examples (more complex of course). If you read a sequence if
ints and then a sequence of floats and need to output a sequence of
floats with int[i] repeats of float[i], it would be easiest to create a
hidden layer representing the int and float sequences and to then
produce output from that. If you don't think about a layer, even this
example gets painful - I need to read an int, skip forward somewhere to
find a float, skip back to get the next int, etc.
Mike's full example, not starting with the XML-ized version,
might be something that requires more than one layer - read the original
into something with with XML schema Mike defines, then a layer making a
sequence of data elements, and then something that has the desired
logical output.
I guess I would claim that this would not be too bad a way to
describe a fairly complex format in terms of a fairly different logical
structure. Whether one *should* do this in DFDL, or whether it would
make more sense to a) write a black box parser to get to items, or b)
use DFDL to get to the initial schema Mike wrote and use XSLT afterwards
to convert to the desired logical structure. I think there are enough
cases where we need the multilayer functionality in DFDL that are
relatively simple that we have to have it, which means it will then be
possible to deal with complex transformations in DFDL even if not
simple/practical.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On
Behalf Of mike.beckerle(a)ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg(a)gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem
- IBM Format VS rec ords as XML
I've come up with a way to articulate the difficulties I'm
having with DFDL for complex file formats.
This problem may not be that hard for someone with more XML,
XPath or XQuery experience, so I'd apprecate it if you could look it
over and if necessary even run it by your resident XML experts.
In case the emailer mangles all the line lengths, I've also
attached the below as a file.
<!-- Example motivated by DFDL for IBM Format-VS -->
<!-- see http://tinyurl.com/3s2bq <http://tinyurl.com/3s2bq>
for details on IBM Format-VS -->
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this "logical" schema -->
<sequence>
<element name="ITEM" type="string" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were starting from. What
you see below simulates
the structural issues of IBM Format-VS, but converting the
problem into an XML to XML
transformation problem -->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This
element is really a type tag. -->
<DATA>The first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of an
item. -->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center of
an item -->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of the
item. -->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this block is a WHOLE
segment. However
in general the 2nd segment of a block could be a
WHOLE or the
FIRST segment of another multi-segment
multi-block spanning item -->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or
more SEGMENTs -->
<!-- Spanning data is broken on arbitrary boundaries across
segments it spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure.
-->
<!-- MIDDLE* means zero or more MIDDLE segments. -->
<!-- The question: how can we express the transformation into
the desired logical form?
Or is this beyond the call of duty for DFDL?
Goals include to be as declarative as possible, and ideally,
do it as a set of
XML Schema annotations in the GGF DFDL style. -->
<!-- here's an XSD (untested) for the input data structure -->
<complexType name="Format_VS_t">
<sequence>
<element name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType name="Block_t">
<sequence>
<element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
</sequence>
</complexType>
<complexType name="Segment_t">
<sequence>
<choice>
<element name="WHOLE">
</element>
<element name="FIRST">
</element>
<element name="LAST">
</element>
<element name="MIDDLE">
</element>
</choice>
<element name="DATA" type="string"/>
</sequence>
</complexType>
1
0

RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Myers, James D 19 Nov '04
by Myers, James D 19 Nov '04
19 Nov '04
I was thinking that step 1 involved recognizing the <first/>
and <data> elements and creating a sequence of <myfirst>here's the
data</myfirst>, <mymiddle>more data</mymiddle> and <mylast>... elements
and then assembling the new layer by some sort of choice to concatenate
the relevant myfirst, optional mymiddle, and myend elements for each
item.
I think that requires a way to make a choice based on the
<first/>, <middle/>, <last/> elements and populate either a <myfirst>,
<mymiddle>, or <mylast> elements (all subtypes of string?) with the
contents of the following data element, which I think we can do in DFDL.
This is just our standard choice flag that decides which of several
options exist.
Then, I think you'd need logic to decide how many elements
represent one item, which I think we have, followed by a way to
concatenate these elements to produce a string source, which again I
think we have (same as saying a complex can be built from two floats
referenced from another layer instead of from a float stream). This part
is the same problem as having a text file where one <CR> separates lines
and <CR><CR> separates paragraphs and you want to create single strings
(from a variable number of lines) for each paragraph.
Again, I won't argue that this is simple and fun, but I think
the machinery exists and is the same as that from our simple examples.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On
Behalf Of mike.beckerle(a)ascentialsoftware.com
Sent: Friday, November 19, 2004 10:44 AM
To: Myers, James D; dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example
problem - IBMFormat VS rec ords as XML
You are thinking along the lines I was; however, the challenge
is that I cannot find a way to do this using multilayer so I'm
uncomfortable suggesting that it's possible at all anymore. Here's some
reasoning why.
In particular, it's the intersection of the induction across the
items with the first, middle*, last thing, and the spanning that seems
to defy my efforts to cut it up into progressive transformation layer by
layer. In some conversations I've referred to this problem as the
"non-conforming trees" problem. The fundamental shapes of the trees are
not compatible, and expressing the transformation between them isn't
easily done via induction of any kind on one or the other of the trees.
To me the First, Middle*, Last thing is very problematic. It's
effectively a little regular language (in the formal sense) that has to
be recognized. Generally this requires a finite-state-machine, and what
makes FSMs interesting and complex is always the way you diagnose
malformed data in addition to recognizing correct data.
Now, a finite-state-machine is, to my mind, the ultimate
procedural abstraction, the quintessential opposite of "declarative"
expression. To be declarative about a FSM you end up saying "recognize
this regular language", and providing a description of the regular
language, which is of course, just begging the question of how it
actually works.
(And for us, we're not really talking about a regular language
of character text, but a pattern of usage in the binary data layout that
obeys the pattern of a regular language. So it's not like having a
little regular expression thing for validating text strings helps with
this problem.)
I guess I'm arguing that a black box approach to this is not
only acceptable, but is highly likely to be the only "good" way to do
it. In light of this I've suggested a rep property called "streamFormat"
(perhaps should be renamed "recordFormat"), which gets values from the
set VS, V, VBS, FB, FBS, etc. etc. all these well-defined legacy data
formats (there are 19 of them I think). In additon, one should be able
to extend this by introduction of a blackbox transformation.
And ... here's the rub...if that's true for this case, then
other "hard" examples like run-length encoding seem also in this
category.
There's several "leaps of faith" just made in these arguments,
so i'd still like people to take this "XML challenge" and see if there's
some magic I'm overlooking.
...mikeb
________________________________
From: Myers, James D [mailto:jim.myers@pnl.gov]
Sent: Friday, November 19, 2004 9:52 AM
To: dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL
example problem - IBM Format VS rec ords as XML
Without digging too much into the details, I'd say this
is an example where multi-layer comes in. The DFDL would describe a
hidden layer in which the first, middle, last data elements would be
identified and put into a list, and then that hidden list would be used
as the input to create items in the output layer.
I think this is conceptually similar to one of our
run-length encoding examples (more complex of course). If you read a
sequence if ints and then a sequence of floats and need to output a
sequence of floats with int[i] repeats of float[i], it would be easiest
to create a hidden layer representing the int and float sequences and to
then produce output from that. If you don't think about a layer, even
this example gets painful - I need to read an int, skip forward
somewhere to find a float, skip back to get the next int, etc.
Mike's full example, not starting with the XML-ized
version, might be something that requires more than one layer - read the
original into something with with XML schema Mike defines, then a layer
making a sequence of data elements, and then something that has the
desired logical output.
I guess I would claim that this would not be too bad a
way to describe a fairly complex format in terms of a fairly different
logical structure. Whether one *should* do this in DFDL, or whether it
would make more sense to a) write a black box parser to get to items, or
b) use DFDL to get to the initial schema Mike wrote and use XSLT
afterwards to convert to the desired logical structure. I think there
are enough cases where we need the multilayer functionality in DFDL that
are relatively simple that we have to have it, which means it will then
be possible to deal with complex transformations in DFDL even if not
simple/practical.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org
[mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle(a)ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg(a)gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example
problem - IBM Format VS rec ords as XML
I've come up with a way to articulate the
difficulties I'm having with DFDL for complex file formats.
This problem may not be that hard for someone
with more XML, XPath or XQuery experience, so I'd apprecate it if you
could look it over and if necessary even run it by your resident XML
experts.
In case the emailer mangles all the line
lengths, I've also attached the below as a file.
<!-- Example motivated by DFDL for IBM Format-VS
-->
<!-- see http://tinyurl.com/3s2bq for details on
IBM Format-VS -->
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this "logical" schema
-->
<sequence>
<element name="ITEM" type="string"
minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were
starting from. What you see below simulates
the structural issues of IBM Format-VS, but
converting the problem into an XML to XML
transformation problem -->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole
item (Duh!). This element is really a type tag. -->
<DATA>The first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the
first part of an item. -->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data
from the center of an item -->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from
the end of the item. -->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this
block is a WHOLE segment. However
in general the 2nd segment of a
block could be a WHOLE or the
FIRST segment of another
multi-segment multi-block spanning item -->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!-- Each SEGMENT is either a WHOLE item, or the
item spans 2 or more SEGMENTs -->
<!-- Spanning data is broken on arbitrary
boundaries across segments it spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST
segment structure. -->
<!-- MIDDLE* means zero or more MIDDLE segments.
-->
<!-- The question: how can we express the
transformation into the desired logical form?
Or is this beyond the call of duty for
DFDL?
Goals include to be as declarative as
possible, and ideally, do it as a set of
XML Schema annotations in the GGF DFDL
style. -->
<!-- here's an XSD (untested) for the input data
structure -->
<complexType name="Format_VS_t">
<sequence>
<element name="BLOCK" type="Block_t"
minOccurs="0" maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType name="Block_t">
<sequence>
<element name="SEGMENT"
type="Segment_t" minOccurs="1" maxOccurs="2"/>
</sequence>
</complexType>
<complexType name="Segment_t">
<sequence>
<choice>
<element name="WHOLE">
</element>
<element name="FIRST">
</element>
<element name="LAST">
</element>
<element name="MIDDLE">
</element>
</choice>
<element name="DATA" type="string"/>
</sequence>
</complexType>
2
1

Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS rec ords as XML
by Suman Kalia 19 Nov '04
by Suman Kalia 19 Nov '04
19 Nov '04
I tend to agree that there 2 inherent logical structures in this scenario.
DFDL scope in my option should be restricted to parsing the physical
stream and populating the logical structure which is complaint with the
structure of physical stream and vice versa. We have numerous options and
technologies (XSLT, XSD<->XSD mappers, good old programming languages,
Xquery) which do pretty good job to transform one logical structure to
another logical structure. Building some kinds of annotations which would
allow a physical stream to map to a completely different logical structure
will make the DFDL language very complex.
Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools
Tel : 905-413-3923 T/L 969-3923
Fax : 905-413-4850
Internet ID : kalia(a)ca.ibm.com
----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 11:36 AM -----
"Myers, James D" <jim.myers(a)pnl.gov>
Sent by: owner-dfdl-wg(a)ggf.org
11/19/2004 11:05 AM
To
dfdl-wg(a)gridforum.org
cc
Subject
RE: [dfdl-wg] simple way to study hard DFDL example problem - IBMFormat VS
rec ords as XML
I was thinking that step 1 involved recognizing the <first/> and <data>
elements and creating a sequence of <myfirst>here's the data</myfirst>,
<mymiddle>more data</mymiddle> and <mylast>... elements and then
assembling the new layer by some sort of choice to concatenate the
relevant myfirst, optional mymiddle, and myend elements for each item.
I think that requires a way to make a choice based on the <first/>,
<middle/>, <last/> elements and populate either a <myfirst>, <mymiddle>,
or <mylast> elements (all subtypes of string?) with the contents of the
following data element, which I think we can do in DFDL. This is just our
standard choice flag that decides which of several options exist.
Then, I think you'd need logic to decide how many elements represent one
item, which I think we have, followed by a way to concatenate these
elements to produce a string source, which again I think we have (same as
saying a complex can be built from two floats referenced from another
layer instead of from a float stream). This part is the same problem as
having a text file where one <CR> separates lines and <CR><CR> separates
paragraphs and you want to create single strings (from a variable number
of lines) for each paragraph.
Again, I won't argue that this is simple and fun, but I think the
machinery exists and is the same as that from our simple examples.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle(a)ascentialsoftware.com
Sent: Friday, November 19, 2004 10:44 AM
To: Myers, James D; dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem -
IBMFormat VS rec ords as XML
You are thinking along the lines I was; however, the challenge is that I
cannot find a way to do this using multilayer so I'm uncomfortable
suggesting that it's possible at all anymore. Here's some reasoning why.
In particular, it's the intersection of the induction across the items
with the first, middle*, last thing, and the spanning that seems to defy
my efforts to cut it up into progressive transformation layer by layer. In
some conversations I've referred to this problem as the "non-conforming
trees" problem. The fundamental shapes of the trees are not compatible,
and expressing the transformation between them isn't easily done via
induction of any kind on one or the other of the trees.
To me the First, Middle*, Last thing is very problematic. It's effectively
a little regular language (in the formal sense) that has to be recognized.
Generally this requires a finite-state-machine, and what makes FSMs
interesting and complex is always the way you diagnose malformed data in
addition to recognizing correct data.
Now, a finite-state-machine is, to my mind, the ultimate procedural
abstraction, the quintessential opposite of "declarative" expression. To
be declarative about a FSM you end up saying "recognize this regular
language", and providing a description of the regular language, which is
of course, just begging the question of how it actually works.
(And for us, we're not really talking about a regular language of
character text, but a pattern of usage in the binary data layout that
obeys the pattern of a regular language. So it's not like having a little
regular expression thing for validating text strings helps with this
problem.)
I guess I'm arguing that a black box approach to this is not only
acceptable, but is highly likely to be the only "good" way to do it. In
light of this I've suggested a rep property called "streamFormat" (perhaps
should be renamed "recordFormat"), which gets values from the set VS, V,
VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there
are 19 of them I think). In additon, one should be able to extend this by
introduction of a blackbox transformation.
And ... here's the rub...if that's true for this case, then other "hard"
examples like run-length encoding seem also in this category.
There's several "leaps of faith" just made in these arguments, so i'd
still like people to take this "XML challenge" and see if there's some
magic I'm overlooking.
...mikeb
From: Myers, James D [mailto:jim.myers@pnl.gov]
Sent: Friday, November 19, 2004 9:52 AM
To: dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
Without digging too much into the details, I'd say this is an example
where multi-layer comes in. The DFDL would describe a hidden layer in
which the first, middle, last data elements would be identified and put
into a list, and then that hidden list would be used as the input to
create items in the output layer.
I think this is conceptually similar to one of our run-length encoding
examples (more complex of course). If you read a sequence if ints and then
a sequence of floats and need to output a sequence of floats with int[i]
repeats of float[i], it would be easiest to create a hidden layer
representing the int and float sequences and to then produce output from
that. If you don't think about a layer, even this example gets painful - I
need to read an int, skip forward somewhere to find a float, skip back to
get the next int, etc.
Mike's full example, not starting with the XML-ized version, might be
something that requires more than one layer - read the original into
something with with XML schema Mike defines, then a layer making a
sequence of data elements, and then something that has the desired logical
output.
I guess I would claim that this would not be too bad a way to describe a
fairly complex format in terms of a fairly different logical structure.
Whether one *should* do this in DFDL, or whether it would make more sense
to a) write a black box parser to get to items, or b) use DFDL to get to
the initial schema Mike wrote and use XSLT afterwards to convert to the
desired logical structure. I think there are enough cases where we need
the multilayer functionality in DFDL that are relatively simple that we
have to have it, which means it will then be possible to deal with complex
transformations in DFDL even if not simple/practical.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle(a)ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg(a)gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
I've come up with a way to articulate the difficulties I'm having with
DFDL for complex file formats.
This problem may not be that hard for someone with more XML, XPath or
XQuery experience, so I'd apprecate it if you could look it over and if
necessary even run it by your resident XML experts.
In case the emailer mangles all the line lengths, I've also attached the
below as a file.
<!-- Example motivated by DFDL for IBM Format-VS -->
<!-- see http://tinyurl.com/3s2bq for details on IBM Format-VS -->
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this "logical" schema -->
<sequence>
<element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were starting from. What you see
below simulates
the structural issues of IBM Format-VS, but converting the problem
into an XML to XML
transformation problem -->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This element
is really a type tag. -->
<DATA>The first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of an item. -->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center of an item
-->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of the item. -->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this block is a WHOLE segment.
However
in general the 2nd segment of a block could be a WHOLE or
the
FIRST segment of another multi-segment multi-block
spanning item -->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or more
SEGMENTs -->
<!-- Spanning data is broken on arbitrary boundaries across segments it
spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure. -->
<!-- MIDDLE* means zero or more MIDDLE segments. -->
<!-- The question: how can we express the transformation into the desired
logical form?
Or is this beyond the call of duty for DFDL?
Goals include to be as declarative as possible, and ideally, do it as
a set of
XML Schema annotations in the GGF DFDL style. -->
<!-- here's an XSD (untested) for the input data structure -->
<complexType name="Format_VS_t">
<sequence>
<element name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType name="Block_t">
<sequence>
<element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
</sequence>
</complexType>
<complexType name="Segment_t">
<sequence>
<choice>
<element name="WHOLE">
</element>
<element name="FIRST">
</element>
<element name="LAST">
</element>
<element name="MIDDLE">
</element>
</choice>
<element name="DATA" type="string"/>
</sequence>
</complexType>
1
0

RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML
by mike.beckerleï¼ ascentialsoftware.com 19 Nov '04
by mike.beckerleï¼ ascentialsoftware.com 19 Nov '04
19 Nov '04
I believe you and Jim are actually disagreeing. Jim is saying he's still
optimistic that this transformation, even though complex, can be expressed
directly in DFDL. You are saying this would require XSLT or a Java program
or whatever to do it.
>
> Mike you say you are aware of 19 such legacy formats, and I
> bet there are more. Well IBM's broker has no specific support
> for any of these, nor have we been asked to incorporate them
> into our message model. Maybe we should play the percentages
> game - if we see enough different subsystems that use the
> same cryptic format then it becomes worth building the
> support into DFDL.
>
Ascential supports 6 or 7 of these formats today. Batch systems will
encounter this more than online. You get them when a mainframe job writes
out a tape on a mainframe, and then you read that tape on a unix tape drive
either directly or first into a file. Alternatively, you pick up a mainframe
file via FTP or some such and directly operate on it on other systems.
Mainframe software handles all the VS block and and such stuff in the lower
layers as you know (not to mention the tape label) unix software does none
of this, you just get the raw bytes.
My point is not as much about these 19 or more particular formats, but the
issue of how much complexity we go after.
In the past we've looked at things like logical arrays with
run-length-encoded representations and the suggestion has been there that
DFDL might be able to directly express this transformation without need to
go outside DFDL.
I've come to believe there are certain limits to this complexity and I think
perhaps tree-shape compatibility is at the core of them. Building a DFDL
description for data that ultimately requires an LR(k) sophistication parser
to correctly interpret the data is clearly a non-starter it seems. Where
this line is drawn is important.
...mikeb
...mikeb
1
0

Fw: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML
by Steve Hanson 19 Nov '04
by Steve Hanson 19 Nov '04
19 Nov '04
I agree with Jim that two DFDL layers are required, one that describes the
original logical structure and one to describe the desired logical
structure. The key thing to recognise is that there are two logical
structures here, and that a transformation of some kind (XSL, Java program,
...) is required to get one from the other.
I don't think we should get DFDL to treat IBM Format VS records as a purely
physical representation of some ideal logical structure - that gets way too
complicated and imposes a big burden on all DFDL implementations.
This is a pretty subjective area - it poses the philisophical question
"when does the physical format become so cryptic that it can be viewed as
changing the logical structure itself".
A structure that asks the same question is an IMS segment. These impose
themselves on the data such that the data is carved into segments that are
preceded with an LLZZ field, the LL containing the segment length. Do you
view the logical structure as a sequence of segments, or do you view it as
the content of the segments where the owning segment # is a physical
property of each field? On a project I worked on in the past, we took the
latter view, which meant that this IMS specific concept found its way into
the physical model, and we had to write specific code to parse & write
segments. I am not convinced that was the right decision.
Mike you say you are aware of 19 such legacy formats, and I bet there are
more. Well IBM's broker has no specific support for any of these, nor have
we been asked to incorporate them into our message model. Maybe we should
play the percentages game - if we see enough different subsystems that use
the same cryptic format then it becomes worth building the support into
DFDL.
Regards, Steve
Steve Hanson
WebSphere Business Integration Brokers,
IBM Hursley, England
Internet: smh(a)uk.ibm.com
Phone (+44)/(0) 1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 19/11/2004 16:13 -----
mike.beckerle@asc
entialsoftware.co
m To
Sent by: jim.myers(a)pnl.gov,
owner-dfdl-wg@ggf dfdl-wg(a)gridforum.org
.org cc
Subject
19/11/2004 15:43 RE: [dfdl-wg] simple way to study
hard DFDL example problem - IBM
Format VS rec ords as XML
You are thinking along the lines I was; however, the challenge is that I
cannot find a way to do this using multilayer so I'm uncomfortable
suggesting that it's possible at all anymore. Here's some reasoning why.
In particular, it's the intersection of the induction across the items with
the first, middle*, last thing, and the spanning that seems to defy my
efforts to cut it up into progressive transformation layer by layer. In
some conversations I've referred to this problem as the "non-conforming
trees" problem. The fundamental shapes of the trees are not compatible, and
expressing the transformation between them isn't easily done via induction
of any kind on one or the other of the trees.
To me the First, Middle*, Last thing is very problematic. It's effectively
a little regular language (in the formal sense) that has to be recognized.
Generally this requires a finite-state-machine, and what makes FSMs
interesting and complex is always the way you diagnose malformed data in
addition to recognizing correct data.
Now, a finite-state-machine is, to my mind, the ultimate procedural
abstraction, the quintessential opposite of "declarative" expression. To be
declarative about a FSM you end up saying "recognize this regular
language", and providing a description of the regular language, which is of
course, just begging the question of how it actually works.
(And for us, we're not really talking about a regular language of character
text, but a pattern of usage in the binary data layout that obeys the
pattern of a regular language. So it's not like having a little regular
expression thing for validating text strings helps with this problem.)
I guess I'm arguing that a black box approach to this is not only
acceptable, but is highly likely to be the only "good" way to do it. In
light of this I've suggested a rep property called "streamFormat" (perhaps
should be renamed "recordFormat"), which gets values from the set VS, V,
VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there
are 19 of them I think). In additon, one should be able to extend this by
introduction of a blackbox transformation.
And ... here's the rub...if that's true for this case, then other "hard"
examples like run-length encoding seem also in this category.
There's several "leaps of faith" just made in these arguments, so i'd still
like people to take this "XML challenge" and see if there's some magic I'm
overlooking.
...mikeb
From: Myers, James D [mailto:jim.myers@pnl.gov]
Sent: Friday, November 19, 2004 9:52 AM
To: dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
Without digging too much into the details, I'd say this is an example
where multi-layer comes in. The DFDL would describe a hidden layer in
which the first, middle, last data elements would be identified and put
into a list, and then that hidden list would be used as the input to
create items in the output layer.
I think this is conceptually similar to one of our run-length encoding
examples (more complex of course). If you read a sequence if ints and then
a sequence of floats and need to output a sequence of floats with int[i]
repeats of float[i], it would be easiest to create a hidden layer
representing the int and float sequences and to then produce output from
that. If you don't think about a layer, even this example gets painful - I
need to read an int, skip forward somewhere to find a float, skip back to
get the next int, etc.
Mike's full example, not starting with the XML-ized version, might be
something that requires more than one layer - read the original into
something with with XML schema Mike defines, then a layer making a
sequence of data elements, and then something that has the desired logical
output.
I guess I would claim that this would not be too bad a way to describe a
fairly complex format in terms of a fairly different logical structure.
Whether one *should* do this in DFDL, or whether it would make more sense
to a) write a black box parser to get to items, or b) use DFDL to get to
the initial schema Mike wrote and use XSLT afterwards to convert to the
desired logical structure. I think there are enough cases where we need
the multilayer functionality in DFDL that are relatively simple that we
have to have it, which means it will then be possible to deal with complex
transformations in DFDL even if not simple/practical.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle(a)ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg(a)gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
I've come up with a way to articulate the difficulties I'm having with
DFDL for complex file formats.
This problem may not be that hard for someone with more XML, XPath or
XQuery experience, so I'd apprecate it if you could look it over and if
necessary even run it by your resident XML experts.
In case the emailer mangles all the line lengths, I've also attached the
below as a file.
<!-- Example motivated by DFDL for IBM Format-VS -->
<!-- see http://tinyurl.com/3s2bq for details on IBM Format-VS -->
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this "logical" schema -->
<sequence>
<element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were starting from. What you see
below simulates
the structural issues of IBM Format-VS, but converting the problem
into an XML to XML
transformation problem -->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This element
is really a type tag. -->
<DATA>The first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of an item. -->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center of an item
-->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of the item. -->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this block is a WHOLE segment.
However
in general the 2nd segment of a block could be a WHOLE or
the
FIRST segment of another multi-segment multi-block
spanning item -->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or more
SEGMENTs -->
<!-- Spanning data is broken on arbitrary boundaries across segments it
spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure. -->
<!-- MIDDLE* means zero or more MIDDLE segments. -->
<!-- The question: how can we express the transformation into the desired
logical form?
Or is this beyond the call of duty for DFDL?
Goals include to be as declarative as possible, and ideally, do it as
a set of
XML Schema annotations in the GGF DFDL style. -->
<!-- here's an XSD (untested) for the input data structure -->
<complexType name="Format_VS_t">
<sequence>
<element name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType name="Block_t">
<sequence>
<element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
</sequence>
</complexType>
<complexType name="Segment_t">
<sequence>
<choice>
<element name="WHOLE">
</element>
<element name="FIRST">
</element>
<element name="LAST">
</element>
<element name="MIDDLE">
</element>
</choice>
<element name="DATA" type="string"/>
</sequence>
</complexType>
1
0

RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML
by mike.beckerleï¼ ascentialsoftware.com 19 Nov '04
by mike.beckerleï¼ ascentialsoftware.com 19 Nov '04
19 Nov '04
You are thinking along the lines I was; however, the challenge is that I
cannot find a way to do this using multilayer so I'm uncomfortable
suggesting that it's possible at all anymore. Here's some reasoning why.
In particular, it's the intersection of the induction across the items with
the first, middle*, last thing, and the spanning that seems to defy my
efforts to cut it up into progressive transformation layer by layer. In some
conversations I've referred to this problem as the "non-conforming trees"
problem. The fundamental shapes of the trees are not compatible, and
expressing the transformation between them isn't easily done via induction
of any kind on one or the other of the trees.
To me the First, Middle*, Last thing is very problematic. It's effectively a
little regular language (in the formal sense) that has to be recognized.
Generally this requires a finite-state-machine, and what makes FSMs
interesting and complex is always the way you diagnose malformed data in
addition to recognizing correct data.
Now, a finite-state-machine is, to my mind, the ultimate procedural
abstraction, the quintessential opposite of "declarative" expression. To be
declarative about a FSM you end up saying "recognize this regular language",
and providing a description of the regular language, which is of course,
just begging the question of how it actually works.
(And for us, we're not really talking about a regular language of character
text, but a pattern of usage in the binary data layout that obeys the
pattern of a regular language. So it's not like having a little regular
expression thing for validating text strings helps with this problem.)
I guess I'm arguing that a black box approach to this is not only
acceptable, but is highly likely to be the only "good" way to do it. In
light of this I've suggested a rep property called "streamFormat" (perhaps
should be renamed "recordFormat"), which gets values from the set VS, V,
VBS, FB, FBS, etc. etc. all these well-defined legacy data formats (there
are 19 of them I think). In additon, one should be able to extend this by
introduction of a blackbox transformation.
And ... here's the rub...if that's true for this case, then other "hard"
examples like run-length encoding seem also in this category.
There's several "leaps of faith" just made in these arguments, so i'd still
like people to take this "XML challenge" and see if there's some magic I'm
overlooking.
...mikeb
_____
From: Myers, James D [mailto:jim.myers@pnl.gov]
Sent: Friday, November 19, 2004 9:52 AM
To: dfdl-wg(a)gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
Without digging too much into the details, I'd say this is an example where
multi-layer comes in. The DFDL would describe a hidden layer in which the
first, middle, last data elements would be identified and put into a list,
and then that hidden list would be used as the input to create items in the
output layer.
I think this is conceptually similar to one of our run-length encoding
examples (more complex of course). If you read a sequence if ints and then a
sequence of floats and need to output a sequence of floats with int[i]
repeats of float[i], it would be easiest to create a hidden layer
representing the int and float sequences and to then produce output from
that. If you don't think about a layer, even this example gets painful - I
need to read an int, skip forward somewhere to find a float, skip back to
get the next int, etc.
Mike's full example, not starting with the XML-ized version, might be
something that requires more than one layer - read the original into
something with with XML schema Mike defines, then a layer making a sequence
of data elements, and then something that has the desired logical output.
I guess I would claim that this would not be too bad a way to describe a
fairly complex format in terms of a fairly different logical structure.
Whether one *should* do this in DFDL, or whether it would make more sense to
a) write a black box parser to get to items, or b) use DFDL to get to the
initial schema Mike wrote and use XSLT afterwards to convert to the desired
logical structure. I think there are enough cases where we need the
multilayer functionality in DFDL that are relatively simple that we have to
have it, which means it will then be possible to deal with complex
transformations in DFDL even if not simple/practical.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle(a)ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg(a)gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
I've come up with a way to articulate the difficulties I'm having with DFDL
for complex file formats.
This problem may not be that hard for someone with more XML, XPath or XQuery
experience, so I'd apprecate it if you could look it over and if necessary
even run it by your resident XML experts.
In case the emailer mangles all the line lengths, I've also attached the
below as a file.
<!-- Example motivated by DFDL for IBM Format-VS -->
<!-- see http://tinyurl.com/3s2bq <http://tinyurl.com/3s2bq> for details on
IBM Format-VS -->
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this "logical" schema -->
<sequence>
<element name="ITEM" type="string" minOccurs="0" maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were starting from. What you see below
simulates
the structural issues of IBM Format-VS, but converting the problem into
an XML to XML
transformation problem -->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!). This element is
really a type tag. -->
<DATA>The first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of an item. -->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center of an item
-->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of the item. -->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this block is a WHOLE segment.
However
in general the 2nd segment of a block could be a WHOLE or
the
FIRST segment of another multi-segment multi-block spanning
item -->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or more
SEGMENTs -->
<!-- Spanning data is broken on arbitrary boundaries across segments it
spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure. -->
<!-- MIDDLE* means zero or more MIDDLE segments. -->
<!-- The question: how can we express the transformation into the desired
logical form?
Or is this beyond the call of duty for DFDL?
Goals include to be as declarative as possible, and ideally, do it as a
set of
XML Schema annotations in the GGF DFDL style. -->
<!-- here's an XSD (untested) for the input data structure -->
<complexType name="Format_VS_t">
<sequence>
<element name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType name="Block_t">
<sequence>
<element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
</sequence>
</complexType>
<complexType name="Segment_t">
<sequence>
<choice>
<element name="WHOLE">
</element>
<element name="FIRST">
</element>
<element name="LAST">
</element>
<element name="MIDDLE">
</element>
</choice>
<element name="DATA" type="string"/>
</sequence>
</complexType>
1
0

RE: [dfdl-wg] simple way to study hard DFDL example problem - IBM Format VS rec ords as XML
by Myers, James D 19 Nov '04
by Myers, James D 19 Nov '04
19 Nov '04
Without digging too much into the details, I'd say this is an example
where multi-layer comes in. The DFDL would describe a hidden layer in
which the first, middle, last data elements would be identified and put
into a list, and then that hidden list would be used as the input to
create items in the output layer.
I think this is conceptually similar to one of our run-length encoding
examples (more complex of course). If you read a sequence if ints and
then a sequence of floats and need to output a sequence of floats with
int[i] repeats of float[i], it would be easiest to create a hidden layer
representing the int and float sequences and to then produce output from
that. If you don't think about a layer, even this example gets painful -
I need to read an int, skip forward somewhere to find a float, skip back
to get the next int, etc.
Mike's full example, not starting with the XML-ized version, might be
something that requires more than one layer - read the original into
something with with XML schema Mike defines, then a layer making a
sequence of data elements, and then something that has the desired
logical output.
I guess I would claim that this would not be too bad a way to describe a
fairly complex format in terms of a fairly different logical structure.
Whether one *should* do this in DFDL, or whether it would make more
sense to a) write a black box parser to get to items, or b) use DFDL to
get to the initial schema Mike wrote and use XSLT afterwards to convert
to the desired logical structure. I think there are enough cases where
we need the multilayer functionality in DFDL that are relatively simple
that we have to have it, which means it will then be possible to deal
with complex transformations in DFDL even if not simple/practical.
Jim
-----Original Message-----
From: owner-dfdl-wg(a)ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf Of
mike.beckerle(a)ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg(a)gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem - IBM
Format VS rec ords as XML
I've come up with a way to articulate the difficulties I'm
having with DFDL for complex file formats.
This problem may not be that hard for someone with more XML,
XPath or XQuery experience, so I'd apprecate it if you could look it
over and if necessary even run it by your resident XML experts.
In case the emailer mangles all the line lengths, I've also
attached the below as a file.
<!-- Example motivated by DFDL for IBM Format-VS -->
<!-- see http://tinyurl.com/3s2bq for details on IBM Format-VS
-->
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this "logical" schema -->
<sequence>
<element name="ITEM" type="string" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
<!-- But the below is the input data were starting from. What
you see below simulates
the structural issues of IBM Format-VS, but converting the
problem into an XML to XML
transformation problem -->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!).
This element is really a type tag. -->
<DATA>The first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of an
item. -->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center
of an item -->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/>
<DATA>he sec</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of the
item. -->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this block is a WHOLE
segment. However
in general the 2nd segment of a block could be
a WHOLE or the
FIRST segment of another multi-segment
multi-block spanning item -->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or
more SEGMENTs -->
<!-- Spanning data is broken on arbitrary boundaries across
segments it spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure.
-->
<!-- MIDDLE* means zero or more MIDDLE segments. -->
<!-- The question: how can we express the transformation into
the desired logical form?
Or is this beyond the call of duty for DFDL?
Goals include to be as declarative as possible, and
ideally, do it as a set of
XML Schema annotations in the GGF DFDL style. -->
<!-- here's an XSD (untested) for the input data structure -->
<complexType name="Format_VS_t">
<sequence>
<element name="BLOCK" type="Block_t" minOccurs="0"
maxOccurs="unbounded"/>
</sequence>
</complexType>
<complexType name="Block_t">
<sequence>
<element name="SEGMENT" type="Segment_t" minOccurs="1"
maxOccurs="2"/>
</sequence>
</complexType>
<complexType name="Segment_t">
<sequence>
<choice>
<element name="WHOLE">
</element>
<element name="FIRST">
</element>
<element name="LAST">
</element>
<element name="MIDDLE">
</element>
</choice>
<element name="DATA" type="string"/>
</sequence>
</complexType>
1
0