Hi,
I have been performing ETL since the early '80s when there
were over 20 different floppy disk formats
and we had to write products like Uniform to copy data from
one format to another.
At MicroPro Int'l (of WordStar fame) I was also involved in
rewriting domestic versions of software
to support Shift-JIS for display and input for the Japanese
market.
More recently I have written and supported ETL software for
the telecommunications sector,
using ATIS XML formats) and for banking which uses Automated
Clearing House (ACH) XML standards.
ACH uses a lot of files with a format: FileHeader,
(BatchHeader, Detail+, BatchTrailer)+, FileTrailer.
As you can imagine I have seen a lot of duplication of
effort due to the lack of a standard way to
define even the simplest of data formats, let alone the
complex ones.
Hence my interest in DFDL, which started with Defuddle.
I have started to read the recent archives to get a sense of
where the DFDL project stands
now compared to where it was in 2003-2005 when the great DFDL
ice-age began and everyone's
projects (Defuddle, Virtual XML) froze in their tracks.
Also, I have reviewed the recent Core-032.2 document, added
comments to it and have a few
general questions about the current state of all things
DFDL.
1. Re Parsing (input) only - How complete is the current
draft spec in terms of being able
to create schemas and a parser for reading binary files?
Would it support the most common delimited
and header/detail/trailer types of files?
From my limited exposure to the drafts and emails, and my
early use of Defuddle, it certainly seems like
the parsing part is nearly complete and ready to have
implementations created.
2. Will a conforming DFDL processor be required to support
both parsing and unparsing?
I have only needed the parsing direction for most of the ETL
work I have been involved with in
the last several years.Each company I worked for had their
own custom file readers and parsers.
Thus I am very interested in having a product like Defuddle
that can read/parse the basics.
3. Can someone clarify the extent, if any, to which DFDL is
expected to be used to validate data
content as opposed to data structure. This isn't at all
clear anywhere in the spec that I could find.
I added a comment suggesting a statement in either the 'What
DFDL is' or the 'What DFDL is not'
section about this.
My assumption is that if a DFDL schema is used to unparse an
infoset the only guarantee is that the
resulting physical structure will be correct and not the
logical structure.
Consider an example of a filed with records having two date
fields: start_date, end_date.
There are two input (parsing) operations that are
potentially useful:
1. Physical - Read the date values into internal elements
(or XML elements) and validate the
presence/absence/nullable state of each
2. Business/Logical - Validate that the end_date is either
null, or if not null is greater than
or equal to the start_date.
Naturally DFDL will support #1 but does it support #2? I
wouldn't think so. The ETL work I do might
not even know what the business rule is and even when we do
we always have to deal with 'dirty' data.
There are two corresponding output (unparsing) operations of
interest:
1. Physical - Write the date values in the proper external
physical format.
2. Business/Logical - validate that the date values in the
infoset that are to be written meet
the business rule stated above.
Again, I would expect DFDL to support #1 but not #2.
I suggest that some comment or description be added to the
spec to make clear the extent to which
DFDL supports the business/logical aspect of the data.
My concern is that users will be misled into thinking they
can arbitrarily populate an infoset and
then, using a DFDL schema, create an external file that can
be properly used by a native application.
It's one thing to 'roundtrip' data that is sourced from a
native application and quite another to
produce valid native files using a non-native application.
Unless, of course you have ideas about branching out to BRDL
- Business Rule Definition Language?
Rick Post