
Tom, The short answer (if my understanding is correct) is that we started basically as you describe but Tara has been implementing on-demand reading of data : Only structure is read up front to create empty parsing classes that get compiled. When you connect that structure to a data source(s), nothing happens to start. When you ask for an element, you either read or skip (by knowing/calculating their length) everything required to find that element's data (which may not simply be all elements above it in the schema) and return the value. I don't think we have a mechanism to free the memory of that element if you are now done with it, though I think we could. We've at least thought about StAX but mostly at the beginning of creating defuddle when it was very new, so part of our decision on JaxMe was relative maturity. So - I think we can do better with the current architecture than a straight fill-the-structure model, but for truly large data, while we might be able to scale, starting directly from a streaming approach might be better or at least more natural (presumably fits the model of the surrounding program better). Jim At 06:54 AM 6/2/2006, Tom Sugden wrote:
Hi all,
Apologies for missing the telcon this week due to other work pressures. I haven't made much progress with any implementation, but have been taking a look at the Defuddle code. I have some questions that someone may be able to answer, and a few thoughts for discussion.
The current Defuddle implementation is based upon JAXME, the Java/XML binding implementation. Presumably JAXME is used to generate an object model representation of the data format described by the DFDL schema. And then, I think the underlying data stream would be marshaled into an instance of that object model. Is this understanding correct?
If my understanding is correct, I'm concerned that this approach may not be suitable for large data streams, since the entire object model instance would probably have to be assembled and stored in memory, like a DOM tree. Has anybody considered using a streamed pull-parsing approach instead, based upon or similar to StAX (Streaming API for XML)?
I was thinking along the lines of parsing the DFDL schema into DOM or some other internal representation. Then pull-parsing the data stream, producing a sequence of StAX-like events corresponding to the data in the stream and its structure. During the pull-parsing, the context would need to be maintained and the conversion algorithm used for transforming parts of the data stream into values of the correct type. These values would then be wrapped in corresponding event objects.
If this approach was viable, then these StAX-like APIs could be used to implement higher-level applications or APIs. For instance, it would be straight-forward to produce an XML serialization of any data described by a DFDL schema. One could also imagine binding any data described by a DFDL schema to auto-generated Java beans, or to a DOM object, when desirable. The process may even be reversible, so that data could be written back to a data stream as well as being read from one.
I haven't thought this through very deeply yet and my understandings of the issues are still quite naive, so I will be very interested to hear any comments. Sorry if this avenue has already been explored, or I've misunderstood the mechanics of Defuddle or JAXME.
Cheers, Tom
James D. Myers Associate Director, Cyberenvironments and Technologies, NCSA 1205 W. Clark St, MC-257 Urbana, IL 61801 217-244-1934 jimmyers@ncsa.uiuc.edu