Alan,
I looked at these examples.
There's one thing I think you've overlooked in the way
transforms are specified here. This is the fact that intFromBinary knows that it
will pick exactly 4 bytes off the input stream, and could advertise that
property to the DFDL "system" in some way, whereas intFromAscii might take
anything from 1 to however many characters. E.g., it might be able to tolerate
whitespace of any size, leading zeros, etc. So as a transform it needs to
advertise that the length of data being consumed requires that you run the
transform.
Where I'm coming from is this. It is very
important that a DFDL description of data enable processing the data
efficiently. To me that means that if data is all fixed width, then one should
be able to randomly access fields in the data in constant time. Even if the data
is variable width, one should be able to efficiently skip through it to find the
boundaries without necessarily having to process all the data, convert to common
format, etc.
To achieve this, transformations must support determining
length and determining value separately when possible.
There are these things I call "length
protocols"
1) FIXED_LENGTH: the length is static in the meaning of the
type. E.g., 4 byte length is implicit in the type
"int"
2) STATIC_LENGTH: the length is static as part of the
element definition. E.g., 12 digit packed decimal known from the Cobol FD.
Or a string with exactly 12 characters. (note that we ignore implications of
variable-width chracter encodings like UTF-8 here on purpose more on that
below).
3) OUTSIDE_LENGTH: the length is dynamic, and comes
from elsewhere. I.e., consider a stored length prefix field. We probably
don't have to touch the data to skip past it, for example, though we did have to
read the length field someplace to know how far to skip.
4) PARSE_LENGTH: the length is dynamic, and computing the
length of the element is as hard as computing the value, so you might as well do
them both simultaneously (e.g., delimited text situation)
Now the character set issue. If the character set is fixed
width, like ascii, ebcdic, or UTF-16, then the above apply as defined. If
the data format is text and the character set is variable width, like UTF-8, or
Shift-jis, then 1, 2, and 3 all collapse into 4. I.e., all lengths require you
to parse the characters one by one. However, I'd like this detail to be pushed
down into the DFDL implementation because there are different ways to do it.
E.g., you could do like Java and convert everything to UTF-16 first and
eliminate the whole issue, or you can try to be more clever.
I think transforms must advertise the protocols they
support. E.g.,
intFromBinary in your example supports only
FIXED_LENGTH protocol, and it should
say the length is exactly 4.
intFromAscii should support protocols 2, 3, and 4. Only
protocol 4 supports delimiters and their attendant complexities like how
embedded delimiters might be quoted or escaped. This "transform" function must
compute both an integer value, and also compute the length of consumed data in
the underlying stream, or by-side-effect advance the stream to the new
position. The point is not to take a position on whether we manage
lengths, or have a stateful cursor on the stream, the point is that there are 3
functions to provide. One is parameterized by a static length, One is
parameterized by a dynamic length, and the third is parameterized by delimiters,
escape sequence specifications, etc. All share the numbase
parameter.
This all adds baggage, but I think it is necessary or
things just can't be efficient.
...mikeb
Third try... No zip, just the 3 files important to
the simple transform example....
Second try on sending these examples. I've cut the
set down to the 3 important files so hopefully it will get through this
time.
Here is the example I mentioned yesterday. Look
particularly at dfdltransforms.xsd, BasicAsciiIntExp.xsd, and
BasicBinIntExp.xsd. Note the "Exp" on those last two files indicate that they
are expansions of the information in the original versions of those files.
These make a first stab at giving a fully verbose description of the structure
and the transforms, i.e., it’s working towards the canonical representation
Martin talked about yesterday. The "dfdltransforms" gives the definitions of
transforms and their components.
There are lots of things that can be improved
here.
<<dfdl-examples.zip>>
Alan R. Chappell
chappella@battelle.org
Pacific Northwest National Laboratory
Battelle Seattle Research Center
(206) 528-3228