Comments on draft 32 of DFDL spec

Hi, I have been performing ETL since the early '80s when there were over 20 different floppy disk formats and we had to write products like Uniform to copy data from one format to another. At MicroPro Int'l (of WordStar fame) I was also involved in rewriting domestic versions of software to support Shift-JIS for display and input for the Japanese market. More recently I have written and supported ETL software for the telecommunications sector, using ATIS XML formats) and for banking which uses Automated Clearing House (ACH) XML standards. ACH uses a lot of files with a format: FileHeader, (BatchHeader, Detail+, BatchTrailer)+, FileTrailer. As you can imagine I have seen a lot of duplication of effort due to the lack of a standard way to define even the simplest of data formats, let alone the complex ones. Hence my interest in DFDL, which started with Defuddle. I have started to read the recent archives to get a sense of where the DFDL project stands now compared to where it was in 2003-2005 when the great DFDL ice-age began and everyone's projects (Defuddle, Virtual XML) froze in their tracks. Also, I have reviewed the recent Core-032.2 document, added comments to it and have a few general questions about the current state of all things DFDL. 1. Re Parsing (input) only - How complete is the current draft spec in terms of being able to create schemas and a parser for reading binary files? Would it support the most common delimited and header/detail/trailer types of files?
From my limited exposure to the drafts and emails, and my early use of Defuddle, it certainly seems like
the parsing part is nearly complete and ready to have implementations created. 2. Will a conforming DFDL processor be required to support both parsing and unparsing? I have only needed the parsing direction for most of the ETL work I have been involved with in the last several years.Each company I worked for had their own custom file readers and parsers. Thus I am very interested in having a product like Defuddle that can read/parse the basics. 3. Can someone clarify the extent, if any, to which DFDL is expected to be used to validate data content as opposed to data structure. This isn't at all clear anywhere in the spec that I could find. I added a comment suggesting a statement in either the 'What DFDL is' or the 'What DFDL is not' section about this. My assumption is that if a DFDL schema is used to unparse an infoset the only guarantee is that the resulting physical structure will be correct and not the logical structure. Consider an example of a filed with records having two date fields: start_date, end_date. There are two input (parsing) operations that are potentially useful: 1. Physical - Read the date values into internal elements (or XML elements) and validate the presence/absence/nullable state of each 2. Business/Logical - Validate that the end_date is either null, or if not null is greater than or equal to the start_date. Naturally DFDL will support #1 but does it support #2? I wouldn't think so. The ETL work I do might not even know what the business rule is and even when we do we always have to deal with 'dirty' data. There are two corresponding output (unparsing) operations of interest: 1. Physical - Write the date values in the proper external physical format. 2. Business/Logical - validate that the date values in the infoset that are to be written meet the business rule stated above. Again, I would expect DFDL to support #1 but not #2. I suggest that some comment or description be added to the spec to make clear the extent to which DFDL supports the business/logical aspect of the data. My concern is that users will be misled into thinking they can arbitrarily populate an infoset and then, using a DFDL schema, create an external file that can be properly used by a native application. It's one thing to 'roundtrip' data that is sourced from a native application and quite another to produce valid native files using a non-native application. Unless, of course you have ideas about branching out to BRDL - Business Rule Definition Language? Rick Post

From my limited exposure to the drafts and emails, and my early use of Defuddle, it certainly seems like
Rick Thanks for reviewing the DFDL document, I will get back to you with responses to your detailed comments. In the meantime I have added answers to your questions below Alan Powell MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England Notes Id: Alan Powell/UK/IBM email: alan_powell@uk.ibm.com Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898 From: "RPost" <rp0428@pacbell.net> To: <dfdl-wg@ogf.org> Date: 07/06/2008 23:42 Subject: [DFDL-WG] Comments on draft 32 of DFDL spec Hi, I have been performing ETL since the early '80s when there were over 20 different floppy disk formats and we had to write products like Uniform to copy data from one format to another. At MicroPro Int'l (of WordStar fame) I was also involved in rewriting domestic versions of software to support Shift-JIS for display and input for the Japanese market. More recently I have written and supported ETL software for the telecommunications sector, using ATIS XML formats) and for banking which uses Automated Clearing House (ACH) XML standards. ACH uses a lot of files with a format: FileHeader, (BatchHeader, Detail+, BatchTrailer)+, FileTrailer. As you can imagine I have seen a lot of duplication of effort due to the lack of a standard way to define even the simplest of data formats, let alone the complex ones. Hence my interest in DFDL, which started with Defuddle. I have started to read the recent archives to get a sense of where the DFDL project stands now compared to where it was in 2003-2005 when the great DFDL ice-age began and everyone's projects (Defuddle, Virtual XML) froze in their tracks. Also, I have reviewed the recent Core-032.2 document, added comments to it and have a few general questions about the current state of all things DFDL. 1. Re Parsing (input) only - How complete is the current draft spec in terms of being able to create schemas and a parser for reading binary files? Would it support the most common delimited and header/detail/trailer types of files? the parsing part is nearly complete and ready to have implementations created. << AWP >> We believe that the current spec is able to deal with most common commercial formats 2. Will a conforming DFDL processor be required to support both parsing and unparsing? I have only needed the parsing direction for most of the ETL work I have been involved with in the last several years.Each company I worked for had their own custom file readers and parsers. Thus I am very interested in having a product like Defuddle that can read/parse the basics. << AWP >> Good question. We have been assuming that both parsing and unparsing would be required. 3. Can someone clarify the extent, if any, to which DFDL is expected to be used to validate data content as opposed to data structure. This isn't at all clear anywhere in the spec that I could find. I added a comment suggesting a statement in either the 'What DFDL is' or the 'What DFDL is not' section about this. My assumption is that if a DFDL schema is used to unparse an infoset the only guarantee is that the resulting physical structure will be correct and not the logical structure. << AWP >> We have distinguished between parsing/unparsing and validation and assumed that validation can be turned off. The validation that is performed is that which is definable using schema constructs such as enumerations, min/max occurs, min/max length, etc. More complex validation, such as cross field validation, is outside the scope of DFDL would require something such as schematron to validate the infoset. Consider an example of a filed with records having two date fields: start_date, end_date. There are two input (parsing) operations that are potentially useful: 1. Physical - Read the date values into internal elements (or XML elements) and validate the presence/absence/nullable state of each 2. Business/Logical - Validate that the end_date is either null, or if not null is greater than or equal to the start_date. Naturally DFDL will support #1 but does it support #2? I wouldn't think so. The ETL work I do might not even know what the business rule is and even when we do we always have to deal with 'dirty' data. There are two corresponding output (unparsing) operations of interest: 1. Physical - Write the date values in the proper external physical format. 2. Business/Logical - validate that the date values in the infoset that are to be written meet the business rule stated above. Again, I would expect DFDL to support #1 but not #2. I suggest that some comment or description be added to the spec to make clear the extent to which DFDL supports the business/logical aspect of the data. My concern is that users will be misled into thinking they can arbitrarily populate an infoset and then, using a DFDL schema, create an external file that can be properly used by a native application. It's one thing to 'roundtrip' data that is sourced from a native application and quite another to produce valid native files using a non-native application. Unless, of course you have ideas about branching out to BRDL - Business Rule Definition Language? Rick Post [attachment "ogf-dfdl-v1.0-Core-032.2_rpost.doc" deleted by Alan Powell/UK/IBM] -- dfdl-wg mailing list dfdl-wg@ogf.org http://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (2)
-
Alan Powell
-
RPost