
DFDLers, Suman and I were discussing a particular data format problem that I undertook to solve in DFDL. I thought it would be good to bring the discussion to the whole group. The problem is tagged data fields. Many computer messaging formats use things like this for some or all of their data fields. Here are two records containing a first and last name in a tagged format: fname:Tim!LName:Stewart; LASTNAME:Smith!firstName:Tom; Notes: the tags have varying forms, i.e., fname, firstname firstName, FIRSTNAME, FNAME, all are accepted as the tag for the first name field. The definition here is that it is case insensitive and either fname or firstname forms. Similar for lastname. Also the tagged fields can appear in any order, and are optional. Here's my test file showing how the XML comes out: (I've attached these as files also in case the email system hammers them.) This is testTaggedData1.xml <?xml version="1.0" encoding="iso-8859-1"?> <!-- Xerces-J fails if you put an internal DTD here so you can use Entity defs. Too bad. --> <dfdlTest xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" xmlns="http://dataformat.org/testCase" xmlns:tc="http://dataformat.org/testCase" xsi:schemaLocation="http://dataformat.org/testCase ../../xsd/testCase.xsd http://dataformat.org/tests testTaggedData1.dfdl.xsd"> <inputTest> <!-- Tagged data example from Suman Kalia of IBM. Each record consists of a first and last name. Each name is tagged, so they can appear in either order. Furthermore, the tagging scheme is too complex to implement with something simple like an initiator, though if we allowed initiators to be regexps that would be able to express this example. --> <data kind="text">fname:Tim!LName:Stewart; LASTNAME:Smith!firstName:Tom; </data> <dfdlSchema file="testTaggedData1.dfdl.xsd"/> <tc:xmlResult xmlns="http://dataformat.org/tests"> <myData> <custInfo> <firstName>Tim</firstName> <lastName>Stewart</lastName> </custInfo> <custInfo> <firstName>Tom</firstName> <lastName>Smith</lastName> </custInfo> </myData> </tc:xmlResult> </inputTest> </dfdlTest> Now, the DFDL itself There are 3 variants here. I'll start with the simplest one. You get a very simple DFDL for this if you assume you can have (a) a way to specify the values of initiators, terminators, and separators as regular expressions (b) support for xsd:all groups. This is testTaggedData3.dfdl.xsd (ok, this one has long lines, so the email system is sure to hammer it, so I won't inline it here.) I think this particular example is pretty straightforward. However, I have two other example DFDL schemas for this which make fewer assumptions. testTaggedData2.dfdl.xsd still allows one to specify regular expressions for the initiator rep property, but does not allow use of xsd:all. Which is a construct I *was* trying to avoid because, well, it's complicated and feels non-primitive. I think you'll agree the complexity goes up significantly. testTaggedData1.dfdl.xsd eliminates specifying the initiator at all, and specifies the tags by way of an additional field hidden in a hidden-layer which has value constrained by an XSD pattern facet to match a specific regular expression. It also does not use xsd:all. My summary from going through this exercise: We need both xsd:all support, and regular expressions for initiators, and all delimiters. I'm happy that one can express these things without needing these constructs, but tagged representations are too commonplace for this much complex construction to be required. The complex constructions I used in testTaggedData1 and testTaggedData2 would only be needed if the tags were complex formatted entities the format of which couldn't be handled by a regular expression. Note that if the tags are actually not case insensitive, but are really fixed strings, then there is no need for the regular expression capability. I'm not sure where we should draw the line here. I'm comfortable with xsd:all support, and plain strings as delimiters or with regexps as delimiters. ...mikeb