Rick
Thanks for reviewing the DFDL document,
I will get back to you with responses to your detailed comments.
In the meantime I have added answers
to your questions below
Alan Powell
MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell@uk.ibm.com
Tel: +44 (0)1962 815073
Fax: +44 (0)1962 816898
From:
| "RPost" <rp0428@pacbell.net>
|
To:
| <dfdl-wg@ogf.org>
|
Date:
| 07/06/2008 23:42
|
Subject:
| [DFDL-WG] Comments on draft 32 of DFDL
spec |
Hi,
I have been performing ETL since the early
'80s when there were over 20 different floppy disk formats
and we had to write products like Uniform
to copy data from one format to another.
At MicroPro Int'l (of WordStar fame) I was
also involved in rewriting domestic versions of software
to support Shift-JIS for display and input
for the Japanese market.
More recently I have written and supported
ETL software for the telecommunications sector,
using ATIS XML formats) and for banking which
uses Automated Clearing House (ACH) XML standards.
ACH uses a lot of files with a format: FileHeader,
(BatchHeader, Detail+, BatchTrailer)+, FileTrailer.
As you can imagine I have seen a lot of duplication
of effort due to the lack of a standard way to
define even the simplest of data formats,
let alone the complex ones.
Hence my interest in DFDL, which started
with Defuddle.
I have started to read the recent archives
to get a sense of where the DFDL project stands
now compared to where it was in 2003-2005
when the great DFDL ice-age began and everyone's
projects (Defuddle, Virtual XML) froze in
their tracks.
Also, I have reviewed the recent Core-032.2
document, added comments to it and have a few
general questions about the current state
of all things DFDL.
1. Re Parsing (input) only - How complete
is the current draft spec in terms of being able
to create schemas and a parser for reading
binary files? Would it support the most common delimited
and header/detail/trailer types of files?
From my limited exposure to the drafts and
emails, and my early use of Defuddle, it certainly seems like
the parsing part is nearly complete and ready
to have implementations created.
<< AWP >> We believe that the
current spec is able to deal with most common commercial formats
2. Will a conforming DFDL processor be required
to support both parsing and unparsing?
I have only needed the parsing direction
for most of the ETL work I have been involved with in
the last several years.Each company I worked
for had their own custom file readers and parsers.
Thus I am very interested in having a product
like Defuddle that can read/parse the basics.
<< AWP >> Good question. We have
been assuming that both parsing and unparsing would be required.
3. Can someone clarify the extent, if any,
to which DFDL is expected to be used to validate data
content as opposed to data structure. This
isn't at all clear anywhere in the spec that I could find.
I added a comment suggesting a statement
in either the 'What DFDL is' or the 'What DFDL is not'
section about this.
My assumption is that if a DFDL schema is
used to unparse an infoset the only guarantee is that the
resulting physical structure will be correct
and not the logical structure.
<< AWP >> We have distinguished
between parsing/unparsing and validation and assumed that validation can
be turned off. The validation that is performed is that which is definable
using schema constructs such as enumerations, min/max occurs, min/max length,
etc. More complex validation, such as cross field validation, is outside
the scope of DFDL would require something such as schematron to validate
the infoset.
Consider an example of a filed with records
having two date fields: start_date, end_date.
There are two input (parsing) operations
that are potentially useful:
1. Physical - Read the date values into internal
elements (or XML elements) and validate the
presence/absence/nullable state of each
2. Business/Logical - Validate that the end_date
is either null, or if not null is greater than
or equal to the start_date.
Naturally DFDL will support #1 but does it
support #2? I wouldn't think so. The ETL work I do might
not even know what the business rule is and
even when we do we always have to deal with 'dirty' data.
There are two corresponding output (unparsing)
operations of interest:
1. Physical - Write the date values in the
proper external physical format.
2. Business/Logical - validate that the date
values in the infoset that are to be written meet
the business rule stated above.
Again, I would expect DFDL to support #1
but not #2.
I suggest that some comment or description
be added to the spec to make clear the extent to which
DFDL supports the business/logical aspect
of the data.
My concern is that users will be misled into
thinking they can arbitrarily populate an infoset and
then, using a DFDL schema, create an external
file that can be properly used by a native application.
It's one thing to 'roundtrip' data that is
sourced from a native application and quite another to
produce valid native files using a non-native
application.
Unless, of course you have ideas about branching
out to BRDL - Business Rule Definition Language?
Rick Post
[attachment "ogf-dfdl-v1.0-Core-032.2_rpost.doc"
deleted by Alan Powell/UK/IBM] --
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU