Re: [dfdl-wg] Defuddle Questions and Pull-Parsing Thoughts

2 Jun 2006

      Tom,
The short answer (if my understanding is correct) is that we started 
basically as you describe but Tara has been implementing on-demand 
reading of data : Only structure is read up front to create empty 
parsing classes that get compiled. When you connect that structure to 
a data source(s), nothing happens to start. When you ask for an 
element, you either read or skip (by knowing/calculating their 
length) everything required to find that element's data (which may 
not simply be all elements above it in the schema) and return the 
value. I don't think we have a mechanism to free the memory of that 
element if you are now done with it, though I think we could.  We've 
at least thought about StAX but mostly at the beginning of creating 
defuddle when it was very new, so part of our decision on JaxMe was 
relative maturity. So - I think we can do better with the current 
architecture than a straight fill-the-structure model, but for truly 
large data, while we might be able to scale, starting directly from a 
streaming approach might be better or at least more natural 
(presumably fits the model of the surrounding program better).

  Jim

At 06:54 AM 6/2/2006, Tom Sugden wrote:
...
Hi all,
Apologies for missing the telcon this week due to other work
pressures. I haven't made much progress with any implementation, but
have been taking a look at the Defuddle code. I have some questions
that someone may be able to answer, and a few thoughts for discussion.
The current Defuddle implementation is based upon JAXME, the Java/XML
binding implementation. Presumably JAXME is used to generate an object
model representation of the data format described by the DFDL schema.
And then, I think the underlying data stream would be marshaled into
an instance of that object model. Is this understanding correct?
If my understanding is correct, I'm concerned that this approach may
not be suitable for large data streams, since the entire object model
instance would probably have to be assembled and stored in memory,
like a DOM tree. Has anybody considered using a streamed pull-parsing
approach instead, based upon or similar to StAX (Streaming API for
XML)?
I was thinking along the lines of parsing the DFDL schema into DOM or
some other internal representation. Then pull-parsing the data stream,
producing a sequence of StAX-like events corresponding to the data in
the stream and its structure. During the pull-parsing, the context
would need to be maintained and the conversion algorithm used for
transforming parts of the data stream into values of the correct type.
These values would then be wrapped in corresponding event objects.
If this approach was viable, then these StAX-like APIs could be used
to implement higher-level applications or APIs. For instance, it would
be straight-forward to produce an XML serialization of any data
described by a DFDL schema. One could also imagine binding any data
described by a DFDL schema to auto-generated Java beans, or to a DOM
object, when desirable. The process may even be reversible, so that
data could be written back to a data stream as well as being read from
one.
I haven't thought this through very deeply yet and my understandings
of the issues are still quite naive, so I will be very interested to
hear any comments. Sorry if this avenue has already been explored, or
I've misunderstood the mechanics of Defuddle or JAXME.
Cheers,
Tom
James D. Myers
Associate Director, Cyberenvironments and Technologies, NCSA
1205 W. Clark St, MC-257
Urbana, IL 61801
217-244-1934
jimmyers@ncsa.uiuc.edu

Re: [dfdl-wg] Defuddle Questions and Pull-Parsing Thoughts

Jim Myers