Jim -- I agree with most of your assertions
and you have phrased it right "relatively compliant
with physical structure". Some of these examples from programming
languages would be " COBOL occur depending upon clause"
and as you mentioned in the example "a previous value in the structure
indicating which field in the choice will be present or how many
occurrences a subsequent field will have" etc.. These
are the most common kind of constructs that occur quite frequently in the
programming structures.
I think DFDL standard is addressing
a very critical requirement "rendering a logical structure to
physical format and vice versa" which no other public standard
has addressed so far to my knowledge and this work is/will be very
complimentary with other standards.
Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools
Tel : 905-413-3923 T/L 969-3923
Fax : 905-413-4850
Internet ID : kalia@ca.ibm.com
----- Forwarded by Suman
Kalia/Toronto/IBM on 11/19/2004 02:05 PM -----
"Myers, James D"
<jim.myers@pnl.gov> Sent by: owner-dfdl-wg@ggf.org
11/19/2004 12:34 PM
To
dfdl-wg@gridforum.org
cc
Subject
RE: [dfdl-wg] simple way
to study hard DFDL example problem - IBMFormat VS rec
ords as XML
Unfortuantely, there's a slippery
slope here - there are no ints on the disk, just logical ones and zeros
that you can transform into a second logical structure composed of ints,
assuming you specify byte order. I think we have a whole stream of examples
beyond that - removing delimiters, using a length prefix to define the
length of a subsequent structure, etc. - that we see as minor transformations
to something still relatively "compliant" with the physical structure,
but, I believe, require the same machinery as things I think we will all
agree are beyond the scope of what DFDL should aim for.
In practice, I think people should
get out of DFDL as soon as possible just as you say - use other technologies
once you get an initial structure. But I think there are cases where you
have to stay in DFDL - anything where I have to transform the initial physically-compliant
structure to interpret subsequent fields - x and y ints tell me how many
pixel repeats, an int greater than another int read previsouly implies
a different subsequent structure, etc. And again, the minimal mechinery
to do that lets you go farther than you'd want people to go in practice.
There may also be reasonable use
cases where the ability to stay in DFDL is important. For example, take
digital preservation, where I might want to map all document files to a
standardized schema, regardless of whether it was word, pdf, etc. Being
able to specify the full descriptions in one file that then requires only
one parser to interpret all formats *might* be worth the cost to do complex
things in DFDL. I don't think our goal for a version 1 should be to support
such use, but I don't think we can meet our simple goals without 'accidentally'
making it possible.
I'd be happy to be proved wrong
- seems like a deep point that would be cool to understand. I'm not sure
how we get to a 'proof' though - we're trying to prove that there exists
something DFDL as currently formulated can't describe. So - we either need
to find that example or turn to some sort of logic formalism to discover
what primitive(s) we're missing that keep us for emulating some class of
parser/programming. (Or find something in DFDL that we don't need to support
the examples we do want to target...).
Jim
-----Original Message-----
From: owner-dfdl-wg@ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf
Of Suman Kalia
Sent: Friday, November 19, 2004 11:50 AM
To: dfdl-wg@gridforum.org
Subject: Fw: [dfdl-wg] simple way to study hard DFDL example problem
- IBMFormat VS rec ords as XML
I tend to agree that there 2 inherent logical structures in this scenario.
DFDL scope in my option should be restricted to parsing the physical
stream and populating the logical structure which is complaint with the
structure of physical stream and vice versa. We have numerous options
and technologies (XSLT, XSD<->XSD mappers, good old programming languages,
Xquery) which do pretty good job to transform one logical structure to
another logical structure. Building some kinds of annotations which
would allow a physical stream to map to a completely different logical
structure will make the DFDL language very complex.
Suman Kalia
IBM Toronto Lab
WebSphere Business Integration Application Connectivity Tools
Tel : 905-413-3923 T/L 969-3923
Fax : 905-413-4850
Internet ID : kalia@ca.ibm.com
----- Forwarded by Suman Kalia/Toronto/IBM on 11/19/2004 11:36 AM -----
"Myers, James D"
<jim.myers@pnl.gov>
Sent by: owner-dfdl-wg@ggf.org
11/19/2004 11:05 AM
To
dfdl-wg@gridforum.org
cc
Subject
RE: [dfdl-wg] simple way
to study hard DFDL example problem - IBMFormat VS rec
ords as XML
I was thinking that step 1 involved recognizing the <first/> and
<data> elements and creating a sequence of <myfirst>here's
the data</myfirst>, <mymiddle>more data</mymiddle> and
<mylast>... elements and then assembling the new layer by some sort
of choice to concatenate the relevant myfirst, optional mymiddle, and myend
elements for each item.
I think that requires a way to make a choice based on the <first/>,
<middle/>, <last/> elements and populate either a <myfirst>,
<mymiddle>, or <mylast> elements (all subtypes of string?)
with the contents of the following data element, which I think we can do
in DFDL. This is just our standard choice flag that decides which of several
options exist.
Then, I think you'd need logic to decide how many elements represent one
item, which I think we have, followed by a way to concatenate these elements
to produce a string source, which again I think we have (same as saying
a complex can be built from two floats referenced from another layer instead
of from a float stream). This part is the same problem as having a text
file where one <CR> separates lines and <CR><CR> separates
paragraphs and you want to create single strings (from a variable number
of lines) for each paragraph.
Again, I won't argue that this is simple and fun, but I think the machinery
exists and is the same as that from our simple examples.
Jim
-----Original Message-----
From: owner-dfdl-wg@ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf
Of mike.beckerle@ascentialsoftware.com
Sent: Friday, November 19, 2004 10:44 AM
To: Myers, James D; dfdl-wg@gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem
- IBMFormat VS rec ords as XML
You are thinking along the lines I was; however, the challenge is that
I cannot find a way to do this using multilayer so I'm uncomfortable suggesting
that it's possible at all anymore. Here's some reasoning why.
In particular, it's the intersection of the induction across the items
with the first, middle*, last thing, and the spanning that seems to defy
my efforts to cut it up into progressive transformation layer by layer.
In some conversations I've referred to this problem as the "non-conforming
trees" problem. The fundamental shapes of the trees are not compatible,
and expressing the transformation between them isn't easily done via induction
of any kind on one or the other of the trees.
To me the First, Middle*, Last thing is very problematic. It's effectively
a little regular language (in the formal sense) that has to be recognized.
Generally this requires a finite-state-machine, and what makes FSMs interesting
and complex is always the way you diagnose malformed data in addition to
recognizing correct data.
Now, a finite-state-machine is, to my mind, the ultimate procedural abstraction,
the quintessential opposite of "declarative" expression. To be
declarative about a FSM you end up saying "recognize this regular
language", and providing a description of the regular language, which
is of course, just begging the question of how it actually works.
(And for us, we're not really talking about a regular language of character
text, but a pattern of usage in the binary data layout that obeys the pattern
of a regular language. So it's not like having a little regular expression
thing for validating text strings helps with this problem.)
I guess I'm arguing that a black box approach to this is not only acceptable,
but is highly likely to be the only "good" way to do it. In light
of this I've suggested a rep property called "streamFormat" (perhaps
should be renamed "recordFormat"), which gets values from the
set VS, V, VBS, FB, FBS, etc. etc. all these well-defined legacy data formats
(there are 19 of them I think). In additon, one should be able to
extend this by introduction of a blackbox transformation.
And ... here's the rub...if that's true for this case, then other "hard"
examples like run-length encoding seem also in this category.
There's several "leaps of faith" just made in these arguments,
so i'd still like people to take this "XML challenge" and see
if there's some magic I'm overlooking.
...mikeb
From: Myers, James D [mailto:jim.myers@pnl.gov]
Sent: Friday, November 19, 2004 9:52 AM
To: dfdl-wg@gridforum.org
Subject: RE: [dfdl-wg] simple way to study hard DFDL example problem
- IBM Format VS rec ords as XML
Without digging too much into the details, I'd say this is an example where
multi-layer comes in. The DFDL would describe a hidden layer in which the
first, middle, last data elements would be identified and put into a list,
and then that hidden list would be used as the input to create items in
the output layer.
I think this is conceptually similar to one of our run-length encoding
examples (more complex of course). If you read a sequence if ints and then
a sequence of floats and need to output a sequence of floats with int[i]
repeats of float[i], it would be easiest to create a hidden layer representing
the int and float sequences and to then produce output from that. If you
don't think about a layer, even this example gets painful - I need to read
an int, skip forward somewhere to find a float, skip back to get the next
int, etc.
Mike's full example, not starting with the XML-ized version, might be something
that requires more than one layer - read the original into something with
with XML schema Mike defines, then a layer making a sequence of data elements,
and then something that has the desired logical output.
I guess I would claim that this would not be too bad a way to describe
a fairly complex format in terms of a fairly different logical structure.
Whether one *should* do this in DFDL, or whether it would make more sense
to a) write a black box parser to get to items, or b) use DFDL to get to
the initial schema Mike wrote and use XSLT afterwards to convert to the
desired logical structure. I think there are enough cases where we need
the multilayer functionality in DFDL that are relatively simple that we
have to have it, which means it will then be possible to deal with complex
transformations in DFDL even if not simple/practical.
Jim
-----Original Message-----
From: owner-dfdl-wg@ggf.org [mailto:owner-dfdl-wg@ggf.org] On Behalf
Of mike.beckerle@ascentialsoftware.com
Sent: Thursday, November 18, 2004 9:53 PM
To: dfdl-wg@gridforum.org
Subject: [dfdl-wg] simple way to study hard DFDL example problem -
IBM Format VS rec ords as XML
I've come up with a way to articulate the difficulties I'm having with
DFDL for complex file formats.
This problem may not be that hard for someone with more XML, XPath or XQuery
experience, so I'd apprecate it if you could look it over and if necessary
even run it by your resident XML experts.
In case the emailer mangles all the line lengths, I've also attached the
below as a file.
<!-- Example motivated by DFDL for IBM Format-VS -->
<!-- see http://tinyurl.com/3s2bq
for details on IBM Format-VS -->
<!-- Logically, our data is this: -->
<ITEM>The first item</ITEM>
<ITEM>This is the second item</ITEM>
<ITEM>The third</ITEM>
<!-- That is, data having this "logical" schema -->
<!-- But the below is the input data were starting from. What you see
below simulates
the structural issues of IBM Format-VS, but converting the
problem into an XML to XML
transformation problem -->
<BLOCK>
<SEGMENT>
<WHOLE/> <!-- a WHOLE segment holds a whole item (Duh!).
This element is really a type tag. -->
<DATA>The first item</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<FIRST/> <!-- a FIRST segment holds the first part of
an item. -->
<DATA>Thi</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<MIDDLE/> <!-- a MIDDLE segment holds data from the center
of an item -->
<DATA>s is t</DATA>
</SEGMENT>
</BLOCK>
<BLOCK>
<SEGMENT>
<LAST/> <!-- a LAST segment holds data from the end of
the item. -->
<DATA>ond item</DATA>
</SEGMENT>
<SEGMENT>
<WHOLE/><!-- This second segment in this block is a WHOLE
segment. However
in general the
2nd segment of a block could be a WHOLE or the
FIRST segment of
another multi-segment multi-block spanning item -->
<DATA>Third item</DATA>
</SEGMENT>
</BLOCK>
<!-- Some observations: -->
<!-- Data is organized into BLOCKs -->
<!-- Each block contains 1 or 2 SEGMENTs -->
<!-- Each SEGMENT is either a WHOLE item, or the item spans 2 or more
SEGMENTs -->
<!-- Spanning data is broken on arbitrary boundaries across segments
it spans -->
<!-- Spanning involves a FIRST, MIDDLE*, LAST segment structure. -->
<!-- MIDDLE* means zero or more MIDDLE segments. -->
<!-- The question: how can we express the transformation into the desired
logical form?
Or is this beyond the call of duty for DFDL?
Goals include to be as declarative as possible, and ideally,
do it as a set of
XML Schema annotations in the GGF DFDL style. -->
<!-- here's an XSD (untested) for the input data structure -->