
DFDLers, Suman and I were discussing a particular data format problem that I undertook to solve in DFDL. I thought it would be good to bring the discussion to the whole group. The problem is tagged data fields. Many computer messaging formats use things like this for some or all of their data fields. Here are two records containing a first and last name in a tagged format: fname:Tim!LName:Stewart; LASTNAME:Smith!firstName:Tom; Notes: the tags have varying forms, i.e., fname, firstname firstName, FIRSTNAME, FNAME, all are accepted as the tag for the first name field. The definition here is that it is case insensitive and either fname or firstname forms. Similar for lastname. Also the tagged fields can appear in any order, and are optional. Here's my test file showing how the XML comes out: (I've attached these as files also in case the email system hammers them.) This is testTaggedData1.xml <?xml version="1.0" encoding="iso-8859-1"?> <!-- Xerces-J fails if you put an internal DTD here so you can use Entity defs. Too bad. --> <dfdlTest xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://dataformat.org/testCase" xmlns:tc="http://dataformat.org/testCase" xsi:schemaLocation="http://dataformat.org/testCase ../../xsd/testCase.xsd http://dataformat.org/tests testTaggedData1.dfdl.xsd"> <inputTest> <!-- Tagged data example from Suman Kalia of IBM. Each record consists of a first and last name. Each name is tagged, so they can appear in either order. Furthermore, the tagging scheme is too complex to implement with something simple like an initiator, though if we allowed initiators to be regexps that would be able to express this example. --> <data kind="text">fname:Tim!LName:Stewart; LASTNAME:Smith!firstName:Tom; </data> <dfdlSchema file="testTaggedData1.dfdl.xsd"/> <tc:xmlResult xmlns="http://dataformat.org/tests"> <myData> <custInfo> <firstName>Tim</firstName> <lastName>Stewart</lastName> </custInfo> <custInfo> <firstName>Tom</firstName> <lastName>Smith</lastName> </custInfo> </myData> </tc:xmlResult> </inputTest> </dfdlTest> Now, the DFDL itself There are 3 variants here. I'll start with the simplest one. You get a very simple DFDL for this if you assume you can have (a) a way to specify the values of initiators, terminators, and separators as regular expressions (b) support for xsd:all groups. This is testTaggedData3.dfdl.xsd (ok, this one has long lines, so the email system is sure to hammer it, so I won't inline it here.) I think this particular example is pretty straightforward. However, I have two other example DFDL schemas for this which make fewer assumptions. testTaggedData2.dfdl.xsd still allows one to specify regular expressions for the initiator rep property, but does not allow use of xsd:all. Which is a construct I *was* trying to avoid because, well, it's complicated and feels non-primitive. I think you'll agree the complexity goes up significantly. testTaggedData1.dfdl.xsd eliminates specifying the initiator at all, and specifies the tags by way of an additional field hidden in a hidden-layer which has value constrained by an XSD pattern facet to match a specific regular expression. It also does not use xsd:all. My summary from going through this exercise: We need both xsd:all support, and regular expressions for initiators, and all delimiters. I'm happy that one can express these things without needing these constructs, but tagged representations are too commonplace for this much complex construction to be required. The complex constructions I used in testTaggedData1 and testTaggedData2 would only be needed if the tags were complex formatted entities the format of which couldn't be handled by a regular expression. Note that if the tags are actually not case insensitive, but are really fixed strings, then there is no need for the regular expression capability. I'm not sure where we should draw the line here. I'm comfortable with xsd:all support, and plain strings as delimiters or with regexps as delimiters. ...mikeb

Very good questions. My thoughts, based on the experience with the parser we use with message broker. Whenever initiators (I will call them tags as it's less typing :) are used, that means the order of fields can be varied, and fields can be omitted. Not surprisingly customers exploit this. Any parser claiming to support the use of text tags in data must therefore support unordered data (ie, xsd:all) and missing data. No different from XML instance documents really. Our parser provides for specifying a single fixed tag (case sensitive) for a field. If a tag could vary in case, or have an alternative form, as your example shows, we would fall back to using a regular expression. But in our case everything matched by the regular expression is treated as data. This latter behaviour is not what you want in this scenario, as the tag ends up being treated as data and anything subsequently processing the data must strip off the tag. The way round this is as you say to allow just the initiator to be specified using a regular expression. However we have not received an explicit requirement for this (yet). I wasn't sure how to read the 'or' in your last sentence. Personally for DFDL 1.0 I think that xsd:all support is a must, but that we could probably get away with a single fixed string for a tag, perhaps accompanied by a 'case sensitive' property. However, regular expression support in general is required in order to distinguish data where there is no tag - you can't parse a SWIFT message without it, for example. So maybe allowing a tag to be specified with a regular expression is not a big deal and we should include it in 1.0 anyway. Final thought on modeling your example. If you know that you will always get either fname & lname, or firstname & lastname, then you could model this as an xsd:choice of two xsd:all groups where each group contained the same child xsd:elements, but with different (fixed) dfdl tags. Regular expression not needed. Obviously this does not scale well and many users do not like having to add extra 'layers' to their models in this way. Regards, Steve Steve Hanson WebSphere Business Integration Brokers, IBM Hursley, England Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848 mike.beckerle@asc entialsoftware.co m To Sent by: dfdl-wg@gridforum.org owner-dfdl-wg@ggf cc .org Subject [dfdl-wg] tagged data examples 07/04/2005 15:26 DFDLers, Suman and I were discussing a particular data format problem that I undertook to solve in DFDL. I thought it would be good to bring the discussion to the whole group. The problem is tagged data fields. Many computer messaging formats use things like this for some or all of their data fields. Here are two records containing a first and last name in a tagged format: fname:Tim!LName:Stewart; LASTNAME:Smith!firstName:Tom; Notes: the tags have varying forms, i.e., fname, firstname firstName, FIRSTNAME, FNAME, all are accepted as the tag for the first name field. The definition here is that it is case insensitive and either fname or firstname forms. Similar for lastname. Also the tagged fields can appear in any order, and are optional. Here's my test file showing how the XML comes out: (I've attached these as files also in case the email system hammers them.) This is testTaggedData1.xml <?xml version="1.0" encoding="iso-8859-1"?> <!-- Xerces-J fails if you put an internal DTD here so you can use Entity defs. Too bad. --> <dfdlTest xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://dataformat.org/testCase" xmlns:tc="http://dataformat.org/testCase" xsi:schemaLocation="http://dataformat.org/testCase ../../xsd/testCase.xsd http://dataformat.org/tests testTaggedData1.dfdl.xsd"> <inputTest> <!-- Tagged data example from Suman Kalia of IBM. Each record consists of a first and last name. Each name is tagged, so they can appear in either order. Furthermore, the tagging scheme is too complex to implement withsomething simple like an initiator, though if we allowed initiators to be regexps thatwould be able to express this example. --> <data kind="text">fname:Tim!LName:Stewart; LASTNAME:Smith!firstName:Tom; </data> <dfdlSchema file="testTaggedData1.dfdl.xsd"/> <tc:xmlResult xmlns="http://dataformat.org/tests"> <myData> <custInfo> <firstName>Tim</firstName> <lastName>Stewart</lastName> </custInfo> <custInfo> <firstName>Tom</firstName> <lastName>Smith</lastName> </custInfo> </myData> </tc:xmlResult> </inputTest> </dfdlTest> Now, the DFDL itself There are 3 variants here. I'll start with the simplest one. You get a very simple DFDL for this if you assume you can have (a) a way to specify the values of initiators, terminators, and separators as regular expressions (b) support for xsd:all groups. This is testTaggedData3.dfdl.xsd (ok, this one has long lines, so the email system is sure to hammer it, so I won't inline it here.) I think this particular example is pretty straightforward. However, I have two other example DFDL schemas for this which make fewer assumptions. testTaggedData2.dfdl.xsd still allows one to specify regular expressions for the initiator rep property, but does not allow use of xsd:all. Which is a construct I *was* trying to avoid because, well, it's complicated and feels non-primitive. I think you'll agree the complexity goes up significantly. testTaggedData1.dfdl.xsd eliminates specifying the initiator at all, and specifies the tags by way of an additional field hidden in a hidden-layer which has value constrained by an XSD pattern facet to match a specific regular expression. It also does not use xsd:all. My summary from going through this exercise: We need both xsd:all support, and regular expressions for initiators, and all delimiters. I'm happy that one can express these things without needing these constructs, but tagged representations are too commonplace for this much complex construction to be required. The complex constructions I used in testTaggedData1 and testTaggedData2 would only be needed if the tags were complex formatted entities the format of which couldn't be handled by a regular expression. Note that if the tags are actually not case insensitive, but are really fixed strings, then there is no need for the regular expression capability. I'm not sure where we should draw the line here. I'm comfortable with xsd:all support, and plain strings as delimiters or with regexps as delimiters. ...mikeb (See attached file: testTaggedData1.xml)(See attached file: testTaggedData3.dfdl.xsd)(See attached file: testTaggedData2.dfdl.xsd)(See attached file: testTaggedData1.dfdl.xsd)
participants (2)
-
mike.beckerle@ascentialsoftware.com
-
Steve Hanson