All,

I have completed my review of draft 33 of the standard. I’ve read through the document as a whole at several times and spent a considerable amount of time digesting each of the sections. I’ve included nearly all my comments in the document (attached) but because there are so many, and reconciling them with ongoing revisions of the document may be difficult, I’ve included what I feel are the important points in this email (I realize there are a lot of points here, but I guess that’s what happens when someone takes a totally fresh look at things). I would like to note that these are all just suggestions – I commented wherever I had a question or concern for completeness sake, but certainly don’t expect all of my feedback to be incorporated – especially since many of the concerns may have already been discussed and addressed in previous iterations of the standard. My only goal and motivation is to and help make the best standard possible, and hopefully some of these suggestions will be food for thought.

General

- The document feels overly verbose and explanatory to me. There are many whole sections and blocks of text that, while very valuable, don’t really seem appropriate in a normative standards document. The document should explain “what is,” not necessarily “why it is.” I understand that it was previously discussed as to whether portions of the document should be extracted and instead included in a separate non-normative “DFDL Primer” similar to the way W3C structured the XML Schema standard. My reaction is that doing so would help clean up the document. Using technical books as a metaphor, my own feeling is that a normative standard should be more like “The Definitive UNIX Reference” and less like “Introduction to UNIX” or even “Expert-Level UNIX”. I think the standard falls a little too far into the latter category right now.

- Related to the previous comment, the section that seemed the most out of place to me was the discussion on the parsing and unparsing processes and their relationship to grammar and general parsing concepts (“DFDL Properties Introduction”). Though the discussion was extremely valuable from the standpoint of a potential implementer and may actually be the only way to implement the standard, I think it may fall too far into the “how to implement” category and might be more appropriate in an appendix (marked as non-normative) or a primer if one is created.

- There are certain sections that seemed a little misplaced to me. In every case, it became clear over time why the document was organized the way it was, but some revision may make it easier to digest. Generally, it seemed that some concepts were “spread out” and not organized under encompassing umbrella sections. Though the way it’s currently structured may make the document more componentized, it makes it harder to understand from a “where do I look for all the information on X” standpoint. The most obvious example is all the sections dealing with representation properties such as the list of representation property precedence and the sections on sequences, choices, etc. My thought is that since all those sections are really discussing different representation properties and aspects thereof that it seems reasonable to group them into one overarching section. There are other areas where I thought the organization could be improved including the discussion on element vs. attribute vs. short binding forms (seemed misplaced given that several other non-representation property annotation element attributes such as setVariableName can use the alternate forms, and also broke up the flow of annotation element descriptions) and the glossary (I like the idea of defining specific terms used broadly throughout the document to remove ambiguity, but a general purpose glossary feels more appropriate as an appendix). To codify my thoughts, I worked up a TOC that I think exhibits a more understandable organizational structure. I’ve attached it not because I want or expect the entire structure of the document to be modified, but as a “jumping off” point for discussion.

- The standard references RFC 2119 for defining certain terms such as MUST, SHALL, etc. In most other standards I’ve seen, emphasis is placed on these terms when their meaning is to be taken from RFC 2119 – I would suggest that DFDL do the same. It also appears as though the terms aren’t being used throughout the document as regularly as they could or should be. I would suggest that at some point in the final revision process we scrub the document for requirements concepts and make sure to use the appropriate RFC 2119 terms where possible. This should remove ambiguity about what’s expected from implementations.

DFDL Information Set

- I’m sure this has already been discussed at length, but I wonder if it would be possible to define the DFDL Information Set as an extension to the XPath Data Model (XDM) as XSLT does for its data model. This would have many advantages. The XDM is compatible with both the XML Schema PSVI and the XML Information Set (and the XDM standard explicitly explains the conversion process to and from each). This therefore provides interoperability with the alternate representations and uses of a DFDL Schema as an XML Schema and as plain XML content. Additionally, an XDM (or some reasonable facsimile) will have to be constructed from the DFDL Schema anyway to support the XPath capabilities of DFDL – basing the DFDL Infoset on XDM to begin with would ensure seamless (or at least easier) use with XPath libraries and infrastructure. Also, XDM and DFDL both use the XML Schema type system (after all, DFDL is a subset of XML Schema) and as such XDM already supports the DFDL types.

- If the above is infeasible or too big of a change for this late in the process, would it at least be possible to define the DFDL Information Set in terms of the XML Information Set standard? The DFDL Information Set already appears to be loosely based on it, and may actually be compatible (I don’t know) but the relationship is not explicit. Without such a statement and the satisfaction of the requirements of extending the XML Information Set as defined in that standard, implementations can’t rely on the compatibility. If the relationship was made explicit and we ensured that the DFDL Information Set was indeed compatible with and extended the XML Information Set, then the XDM needed to process DFDL expressions could be generated using the Infoset to XDM process described in the XDM standard. If we went this route, I would also make sure that we maintain compatibility with the PSVI – that is, we don’t want to introduce concepts or information set members that conflict with the PSVI. This will make it easier on implementers because they could potentially reuse the same internal infoset representation for both the DFDL Information Set and the PSVI during validation processes.

- I imagine it will take some investigation to determine if either of these options is possible and compatible with DFDL concepts – I don’t mind taking on the task if there is interest in modifying the DFDL infoset. It just seems a shame to me to forgo an opportunity to establish some synergy with related XML standards.

- In any case, the concept of simple element information items and complex element information items seems contrary to established convention. The concept of using character information items (and groupings of them as explicitly allowed in the XML Information Set standard) to represent child simple content has already been established through the XML Information Set standard and other related XML standards. It is especially confusing given that the same terms as the XML Information Set are used.

- Should everything in the DFDL Information Set have a corresponding representation in an XML document generated from or used to generate it (not the DFDL Schema, but the result or input to parsing or unparsing)? This question occurred to me based on the discussion in the most recent teleconference about treating and representing comments as separate kinds of content. It was suggested that the infoset would need to handle comments in a special way as to differentiate them from non-commented content. My concern is that there may not be an appropriate XML representation of such an infoset item. Creating a special element in the result document would break the property that the result of DFDL processing can be validated by the DFDL Schema (because the commented element wouldn’t have been declared in the original DFDL Schema – it couldn’t be a declared element because comments can appear anywhere in the source content). The only other option I can see would be to treat source content identified as comments and indicated as such in the infoset as XML comments in the result document. This brings up the interesting complication during unparsing of differentiating between “real” XML comments (those that should truly be ignored) and “output” XML comments (those that should be output to the result stream as commented content). The solution might be to treat all XML comments in a document used for unparsing as available for output as commented content, but it seems unreasonable to redefine XML comments in that way. This brings us back to the original question: if there is no way to adequately describe commented content in a resultant XML document, does everything in the infoset need a representation in an XML result document? What are the implications to upholding the ability to round-trip (if the resultant XML document doesn’t contain everything in the infoset, and everything in the infoset is needed to fully describe the source content, then unparsing the resultant XML document will not result in the original source content)?

Annotation Elements and Representation Properties

- There seems to be inconsistencies throughout the document, specifically in the descriptions of annotation elements and representation properties. This is to be expected in a document that’s been under heavy revision over such a long time span, but an effort will need to be made to scrub out all inconsistencies before the final version. To this end, I’ve found creating a table of all annotation elements and their properties helpful. I’ve attached what I have so far. It has all annotation elements and their attributes and a notional start for the representation properties. I intend to complete it as I go and hopefully make sure everything matches up in the process.

- I’m not sure I understand the value in having the specialized annotation elements. From the DFDL user/developer perspective it seems more difficult because they need to recognize additional syntax. For example, when they see a dfdl:choice annotation element they need to understand that it’s really a dfdl:format with a subset of allowed representation properties appropriate to xs:choice elements. They still must refer to the standard document to find out which representation properties are allowed, and the alternate syntax doesn’t necessarily help in validation because a standard dfdl:format annotation element would also have been valid (and the DFDL XML Schema can’t determine which representation properties are valid on a dfdl:format based only on usage location). It also makes the document more confusing because representation properties are refered to as being valid for specific dfdl:* annotation elements as opposed to the real meaning which is that they’re valid for dfdl:format elements that annotate specific XML elements. To put it another way, a representation property that is valid for dfdl:choice is also only valid for dfdl:format when used as an annotation of xs:choice or as a short form property on xs:choice elements – but this isn’t necessarily clear from the property descriptions since they only refer to the dfdl:* special annotation elements. From an implementation perspective, it adds complexity because the extra element names must be accepted. The DFDL parser will still have to validate representation properties and their validity as applied to the parent schema element regardless of whether the annotation element is a dfdl:format or a special annotation element. Not to mention, wouldn’t short form be used most frequently anyway, in which case there are no annotation elements? In any case, I see very little value for a disproportionate amount of added complexity and potential confusion and I suggest the concept of special annotation properties that restrict dfdl:format be removed.

- The standard isn’t totally clear and unambiguous on the behavior with respect to the dfdl:format selector property. It is mentioned that the selector is externally identified, but no additional information is given. Are the selectors implementation specific? If so, does that break compatibility with alternate DFDL parsers if the selector property is used? What if a selector is referred to but doesn’t exist in the parser? Is it a schema definition error or a parsing error (when are external selectors resolved)? What if there is no “default” dfdl:format block and they all contain non-matching selectors? Is that a processing error (should be explicit)?

- When a defined format is put into use, how/when are the representation properties checked for validity with respect to the schema element (such as xs:choice) that put the defined format into use? To put it another way, is it an error (and what kind) if a defined format specifies representation properties that aren’t valid for the schema element that uses it? Can a defined format contain the special format annotation elements (dfdl:sequence, dfdl:choice, etc.)?

- The standard says that a dfdl:defineFormat can contain any of the other annotation elements. How are the other annotation elements contained within a dfdl:defineFormat (such as dfdl:assert or dfdl:hidden) applied when a named format definition is referenced by a dfdl:format ref attribute? Can named format definitions be referenced anywhere else other than where a dfdl:format is expected? Do all other annotation elements make sense or be valid wherever a defined format would be referenced? If not, suggest explicitly stating what annotation elements are allowed within a dfdl:defineFormat as opposed to saying any are allowed.

- The descriptions for dfdl:assert and dfdl:discriminator read very similarly (and probably for good reason) but it’s not clear how they’re different. If the failure of a dfdl:discriminator results in a processing error, doesn’t that make it equivalent to an assert? In other words, how can it be used for control when one of the two possible outcomes results in an error that (potentially) halts processing? May want to refine the description of dfdl:discriminator.

- What about the positioning of hidden elements relative to siblings? Can they appear anywhere within the parent - in which case, is relative position important? May want to address this one way or the other.

- The properties for dfdl:textNumberFormat are defined in the representation property section. Granted, they may be representation properties from the conceptual level, but syntactically, they would appear to be different. Can the text number format properties be used in a dfdl:format or dfdl:property element? If not, then suggest treating them more as attributes of the dfdl:textNumberFormat element and defining them there. If so, then I wonder what the purpose of the dfdl:textNumberFormat element is…it would seem to fall into the same category as the other special dfdl:format annotation elements that restrict the set of valid representation properties. Also seems to apply to dfdl:defineEscapeScheme and its representation properties.

- The document isn’t clear on how the position of a variable declaration impacts its scope. Does it apply to all children of the element to which the definition belong (regardless of position relative to the definition), to all siblings following the definition (but not preceding), or to all elements following the definition (regardless or hierarchy). I assume the first, but more clarification would be helpful in order to make it unambiguous.

- The value type for representation properties is listed in conceptual terms, but shouldn’t all properties actually accept one or more specific XML Schema (within the DFDL subset) types (usually atomic)? Making this explicit would remove confusion on the part of implementers. For example, several are defined as ‘Enum’ – though the value may logically be an enumerated type, the actual atomic type is something else like xs:string or xs:token – with additional validation to ensure it’s one of the allowed enumerated values. The normative standard is first and foremost a reference for implementations and as such should be totally unambiguous with regard to typing information.

- There appear to be cases (such as alignment) where multiple types can be accepted unnecessarily. In the alignment case, a specific xs:string or a positive integer type is valid. Wouldn’t this be easier on both the DFDL XML Schema and the implementations if, wherever possible, only one atomic type was accepted? In the alignment case, a xs:nonNegativeInteger could be used where ‘0’ means ‘implicit’.

- With regard to case sensitivity, how is the case equivalency defined for different character sets? Is this (or should it be) related to XPath collations? Perhaps instead of an ignore case switch the user should be allowed to specify a collation for initiator/terminator comparison and the DFDL standard would require implementations include a case-insensitive collation for common character sets. This would open the door to using more general character/string comparison operations and could be important in certain settings – for example, the XPath standard has an example that ‘v’ and ‘w’ are equivalent in Swedish. This may have some other advantages – if collations are needed for this kind of thing, then we could probably support fn:compare and fn:codepoint-equal in the DFDL XPath subset.

If you’ve made it this far, congrats :) Hopefully this list will spur some discussion.

Thanks,

Dave

---
David Glick | dglick@dracorp.com | 703.299.0700 x212
Data Research and Analysis Corp. | www.dracorp.com