Re: [DFDL-WG] Comments on draft 32 of DFDL spec

10 Jun 2008

      Great to hear from you Jim.

So defuddle is alive still! I hope we can push it towards agreement
with the current draft, and then as you suggested it can play a role
of supporting the needed experiments in layering.

I concur that much complexity in DFDL would be better as a library on
top of an extensible core, thereby allowing the standard to be
decomposed better.  I hope defuddle can help us figure that out. We've
 dropped the extensibility and most of layering from v1.0 of DFDL only
 for lack of examples from which to standardize. We have retained
little bits of it, e.g., hidden elements, and we recently added
something which might be called "generalized markup", where you can
specify a type name (restricted to simple type for now) as the
delimiter, and an instance of that type can be used as the intitiator,
terminator, or separator of elements. This seems to be core. That is,
a speculative parser needs syntax to go after, otherwise it ends up
non-deterministic, but we can generalize what that syntax is, and so
get a great deal more generality. We did this to avoid the
proliferation of keywords for properties that is otherwise required
for every variation on separators, e.g., can I use a regexp to define
what a delimiter looks like? Yes, you setup a simple type which is a
string matching a regexp, and use that as the delimiter.

I have been interested for a while in something I call core DFDL which
is the smallest set of features from which one can bootstrap the rest.
 Of course this requires the ability to define new properties and data
 types. I thing about this in terms of having only one built-in type:
dfdl:bit, the expression language.

This sounds appealing, but in thinking about it lots of time gets
spent worrying about how to synthesize character sets from bytes,
numbers from bytes, etc. rather than the thorny stuff we really need
extensibility for, which is the obsure delimiter and nullability
features and stuff like "finalTerminatorCanBeMissing" where the
descriptions in english are complicated and a bootstrap would be a
better characterization. I think the generalized markup described
above may prove to be core. I.e., a sequence of elements with
delimiters naturally implies a parser that has to search for
delimiters, but what those delimiters are can be quite general.

...mikeb

On Jun 8, 2008, at 1:37 PM, JimMyers <jimmyers@ncsa.uiuc.edu> wrote:
...
Rick,
A quick update from NCSA on Defuddle: after we got things started at
PNNL and Tara Talbott created the initial version, we did indeed
have an 'ice age' with no direct funding for it. Last year we
received a small amount of funding at NCSA (where I moved to) from
NARA to start Defuddle moving again and to incorporate it into the
EU SHAMAN project's digital preservation architecture (lots of
things in SHAMAN but the relevant idea here is the iRODS storage
broker calling Defuddle to map things to logical models and the
Multivalent Browser viewing the logical model).
Due in part to the lack of funding, we've dropped out of the DFDL
discussion for a while. Probably the most significant issues where
Defuddle doesn't match the draft spec are that I believe the idea of
layers has been dropped from the version 1 spec plan (which we think
is critical)and there has been a lot of work on the spec to deal
with nilability, etc. (some of which I've argued can be avoided if
you have layers).
In the next year+, we're rebuilding Defuddle on the latest
libraries, doing a bunch of scalability and stress testing,
targetting some common file formats (.e.g PNG) to show it works
beyond the relatively simple scientific formats we've done to date,
exploring uses in the digital library/preservation communities and
looking to extend from XML modeling to RDF/semantic modeling.
The last one of these starts to go in the direction of your
questions #2 - if you get to RDF, you can start using OWL/rule
constraints to assure that the data coming out/going in is
semantically what you want (The DFDL group has discussed the
backward direction, but at least for Defuddle it is not yet in our
actual development plans). Our intial thought for doing this is to
just include a GRRDL annotation in the DFDL file which tells you how
to map (via XSLT for example) from the XML created by the current
Defuddle to the RDF you want - in essence going from binary to XML
and XML to RDF logical model in two sequential steps. We haven't
thought as deeply about semantic validation as the DFDL group has
about XML-level validation (i.e. the discussion of errors that can
be caught at the time of format definition, versus XML schema
constraint violations on parsing versus parsing errors themselves,
etc.), but I think getting to RDF will allow a lot of what you're
talking about with standard semantic web tools.
As we get going on Defuddle again, I hope to get reconnected with
the DFDL effort - I've mostly been lurking on the list the past year
+ - and see how we can help without being disruptive (perhaps
looking at post 1.0 changes?).
In any case, I wanted to respond to your question and give a 'what's
new' report back to the group since I've been quiet a while. For
Defuddle, while I still think we need to grow, things are looking up
with a very interested sponsor, international collaboration, and
some momentum after the 'ice age'. (As always, I'd be very happy to
talk with anyone who'd like to get involved in Defuddle (software
development or creating format descriptions and using it,  etc.) -
Defuddle is open source and I know everyone involved to date would
really like to see it become community (versus project) driven.)
Cheers,
Jim
James D. Myers, Ph.D.
Associate Director, Cyberenvironments
National Center for Supercomputing Applications
University of Illinois at Urbana Champaign
1205 W Clark St.
Urbana, IL 61801
217-244-1934
----- "RPost" <rp0428@pacbell.net> wrote:
...
Hi,
I have been performing ETL since the early '80s when there were over
20 different floppy disk formats
and we had to write products like Uniform to copy data from one
format
to another.
At MicroPro Int'l (of WordStar fame) I was also involved in rewriting
domestic versions of software
to support Shift-JIS for display and input for the Japanese market.
More recently I have written and supported ETL software for the
telecommunications sector,
using ATIS XML formats) and for banking which uses Automated Clearing
House (ACH) XML standards.
ACH uses a lot of files with a format: FileHeader, (BatchHeader,
Detail+, BatchTrailer)+, FileTrailer.
As you can imagine I have seen a lot of duplication of effort due to
the lack of a standard way to
define even the simplest of data formats, let alone the complex ones.
Hence my interest in DFDL, which started with Defuddle.
I have started to read the recent archives to get a sense of where
the
DFDL project stands
now compared to where it was in 2003-2005 when the great DFDL ice-age
began and everyone's
projects (Defuddle, Virtual XML) froze in their tracks.
Also, I have reviewed the recent Core-032.2 document, added comments
to it and have a few
general questions about the current state of all things DFDL.
1. Re Parsing (input) only - How complete is the current draft spec
in
terms of being able
to create schemas and a parser for reading binary files? Would it
support the most common delimited
and header/detail/trailer types of files?
From my limited exposure to the drafts and emails, and my early use
of
Defuddle, it certainly seems like
the parsing part is nearly complete and ready to have implementations
created.
2. Will a conforming DFDL processor be required to support both
parsing and unparsing?
I have only needed the parsing direction for most of the ETL work I
have been involved with in
the last several years.Each company I worked for had their own custom
file readers and parsers.
Thus I am very interested in having a product like Defuddle that can
read/parse the basics.
3. Can someone clarify the extent, if any, to which DFDL is expected
to be used to validate data
content as opposed to data structure. This isn't at all clear
anywhere
in the spec that I could find.
I added a comment suggesting a statement in either the 'What DFDL is'
or the 'What DFDL is not'
section about this.
My assumption is that if a DFDL schema is used to unparse an infoset
the only guarantee is that the
resulting physical structure will be correct and not the logical
structure.
Consider an example of a filed with records having two date fields:
start_date, end_date.
There are two input (parsing) operations that are potentially useful:
1. Physical - Read the date values into internal elements (or XML
elements) and validate the
presence/absence/nullable state of each
2. Business/Logical - Validate that the end_date is either null, or
if
not null is greater than
or equal to the start_date.
Naturally DFDL will support #1 but does it support #2? I wouldn't
think so. The ETL work I do might
not even know what the business rule is and even when we do we always
have to deal with 'dirty' data.
There are two corresponding output (unparsing) operations of
interest:
1. Physical - Write the date values in the proper external physical
format.
2. Business/Logical - validate that the date values in the infoset
that are to be written meet
the business rule stated above.
Again, I would expect DFDL to support #1 but not #2.
I suggest that some comment or description be added to the spec to
make clear the extent to which
DFDL supports the business/logical aspect of the data.
My concern is that users will be misled into thinking they can
arbitrarily populate an infoset and
then, using a DFDL schema, create an external file that can be
properly used by a native application.
It's one thing to 'roundtrip' data that is sourced from a native
application and quite another to
produce valid native files using a non-native application.
Unless, of course you have ideas about branching out to BRDL -
Business Rule Definition Language?
Rick Post
--
 dfdl-wg mailing list
 dfdl-wg@ogf.org
 http://www.ogf.org/mailman/listinfo/dfdl-wg
--
 dfdl-wg mailing list
 dfdl-wg@ogf.org
 http://www.ogf.org/mailman/listinfo/dfdl-wg

Re: [DFDL-WG] Comments on draft 32 of DFDL spec

Mike Beckerle