
Great to hear from you Jim. So defuddle is alive still! I hope we can push it towards agreement with the current draft, and then as you suggested it can play a role of supporting the needed experiments in layering. I concur that much complexity in DFDL would be better as a library on top of an extensible core, thereby allowing the standard to be decomposed better. I hope defuddle can help us figure that out. We've dropped the extensibility and most of layering from v1.0 of DFDL only for lack of examples from which to standardize. We have retained little bits of it, e.g., hidden elements, and we recently added something which might be called "generalized markup", where you can specify a type name (restricted to simple type for now) as the delimiter, and an instance of that type can be used as the intitiator, terminator, or separator of elements. This seems to be core. That is, a speculative parser needs syntax to go after, otherwise it ends up non-deterministic, but we can generalize what that syntax is, and so get a great deal more generality. We did this to avoid the proliferation of keywords for properties that is otherwise required for every variation on separators, e.g., can I use a regexp to define what a delimiter looks like? Yes, you setup a simple type which is a string matching a regexp, and use that as the delimiter. I have been interested for a while in something I call core DFDL which is the smallest set of features from which one can bootstrap the rest. Of course this requires the ability to define new properties and data types. I thing about this in terms of having only one built-in type: dfdl:bit, the expression language. This sounds appealing, but in thinking about it lots of time gets spent worrying about how to synthesize character sets from bytes, numbers from bytes, etc. rather than the thorny stuff we really need extensibility for, which is the obsure delimiter and nullability features and stuff like "finalTerminatorCanBeMissing" where the descriptions in english are complicated and a bootstrap would be a better characterization. I think the generalized markup described above may prove to be core. I.e., a sequence of elements with delimiters naturally implies a parser that has to search for delimiters, but what those delimiters are can be quite general. ...mikeb On Jun 8, 2008, at 1:37 PM, JimMyers <jimmyers@ncsa.uiuc.edu> wrote:
Rick,
A quick update from NCSA on Defuddle: after we got things started at PNNL and Tara Talbott created the initial version, we did indeed have an 'ice age' with no direct funding for it. Last year we received a small amount of funding at NCSA (where I moved to) from NARA to start Defuddle moving again and to incorporate it into the EU SHAMAN project's digital preservation architecture (lots of things in SHAMAN but the relevant idea here is the iRODS storage broker calling Defuddle to map things to logical models and the Multivalent Browser viewing the logical model).
Due in part to the lack of funding, we've dropped out of the DFDL discussion for a while. Probably the most significant issues where Defuddle doesn't match the draft spec are that I believe the idea of layers has been dropped from the version 1 spec plan (which we think is critical)and there has been a lot of work on the spec to deal with nilability, etc. (some of which I've argued can be avoided if you have layers).
In the next year+, we're rebuilding Defuddle on the latest libraries, doing a bunch of scalability and stress testing, targetting some common file formats (.e.g PNG) to show it works beyond the relatively simple scientific formats we've done to date, exploring uses in the digital library/preservation communities and looking to extend from XML modeling to RDF/semantic modeling.
The last one of these starts to go in the direction of your questions #2 - if you get to RDF, you can start using OWL/rule constraints to assure that the data coming out/going in is semantically what you want (The DFDL group has discussed the backward direction, but at least for Defuddle it is not yet in our actual development plans). Our intial thought for doing this is to just include a GRRDL annotation in the DFDL file which tells you how to map (via XSLT for example) from the XML created by the current Defuddle to the RDF you want - in essence going from binary to XML and XML to RDF logical model in two sequential steps. We haven't thought as deeply about semantic validation as the DFDL group has about XML-level validation (i.e. the discussion of errors that can be caught at the time of format definition, versus XML schema constraint violations on parsing versus parsing errors themselves, etc.), but I think getting to RDF will allow a lot of what you're talking about with standard semantic web tools.
As we get going on Defuddle again, I hope to get reconnected with the DFDL effort - I've mostly been lurking on the list the past year + - and see how we can help without being disruptive (perhaps looking at post 1.0 changes?).
In any case, I wanted to respond to your question and give a 'what's new' report back to the group since I've been quiet a while. For Defuddle, while I still think we need to grow, things are looking up with a very interested sponsor, international collaboration, and some momentum after the 'ice age'. (As always, I'd be very happy to talk with anyone who'd like to get involved in Defuddle (software development or creating format descriptions and using it, etc.) - Defuddle is open source and I know everyone involved to date would really like to see it become community (versus project) driven.)
Cheers,
Jim
James D. Myers, Ph.D. Associate Director, Cyberenvironments National Center for Supercomputing Applications University of Illinois at Urbana Champaign 1205 W Clark St. Urbana, IL 61801 217-244-1934
----- "RPost" <rp0428@pacbell.net> wrote:
Hi,
I have been performing ETL since the early '80s when there were over 20 different floppy disk formats
and we had to write products like Uniform to copy data from one format to another.
At MicroPro Int'l (of WordStar fame) I was also involved in rewriting domestic versions of software
to support Shift-JIS for display and input for the Japanese market.
More recently I have written and supported ETL software for the telecommunications sector,
using ATIS XML formats) and for banking which uses Automated Clearing House (ACH) XML standards.
ACH uses a lot of files with a format: FileHeader, (BatchHeader, Detail+, BatchTrailer)+, FileTrailer.
As you can imagine I have seen a lot of duplication of effort due to the lack of a standard way to
define even the simplest of data formats, let alone the complex ones.
Hence my interest in DFDL, which started with Defuddle.
I have started to read the recent archives to get a sense of where the DFDL project stands
now compared to where it was in 2003-2005 when the great DFDL ice-age began and everyone's
projects (Defuddle, Virtual XML) froze in their tracks.
Also, I have reviewed the recent Core-032.2 document, added comments to it and have a few
general questions about the current state of all things DFDL.
1. Re Parsing (input) only - How complete is the current draft spec in terms of being able
to create schemas and a parser for reading binary files? Would it support the most common delimited
and header/detail/trailer types of files?
From my limited exposure to the drafts and emails, and my early use of Defuddle, it certainly seems like
the parsing part is nearly complete and ready to have implementations created.
2. Will a conforming DFDL processor be required to support both parsing and unparsing?
I have only needed the parsing direction for most of the ETL work I have been involved with in
the last several years.Each company I worked for had their own custom file readers and parsers.
Thus I am very interested in having a product like Defuddle that can read/parse the basics.
3. Can someone clarify the extent, if any, to which DFDL is expected to be used to validate data
content as opposed to data structure. This isn't at all clear anywhere in the spec that I could find.
I added a comment suggesting a statement in either the 'What DFDL is' or the 'What DFDL is not'
section about this.
My assumption is that if a DFDL schema is used to unparse an infoset the only guarantee is that the
resulting physical structure will be correct and not the logical structure.
Consider an example of a filed with records having two date fields: start_date, end_date.
There are two input (parsing) operations that are potentially useful:
1. Physical - Read the date values into internal elements (or XML elements) and validate the
presence/absence/nullable state of each
2. Business/Logical - Validate that the end_date is either null, or if not null is greater than
or equal to the start_date.
Naturally DFDL will support #1 but does it support #2? I wouldn't think so. The ETL work I do might
not even know what the business rule is and even when we do we always have to deal with 'dirty' data.
There are two corresponding output (unparsing) operations of interest:
1. Physical - Write the date values in the proper external physical format.
2. Business/Logical - validate that the date values in the infoset that are to be written meet
the business rule stated above.
Again, I would expect DFDL to support #1 but not #2.
I suggest that some comment or description be added to the spec to make clear the extent to which
DFDL supports the business/logical aspect of the data.
My concern is that users will be misled into thinking they can arbitrarily populate an infoset and
then, using a DFDL schema, create an external file that can be properly used by a native application.
It's one thing to 'roundtrip' data that is sourced from a native application and quite another to
produce valid native files using a non-native application.
Unless, of course you have ideas about branching out to BRDL - Business Rule Definition Language?
Rick Post
-- dfdl-wg mailing list dfdl-wg@ogf.org http://www.ogf.org/mailman/listinfo/dfdl-wg -- dfdl-wg mailing list dfdl-wg@ogf.org http://www.ogf.org/mailman/listinfo/dfdl-wg