[DFDL-WG] DFDL Revision 033 Comments

20 Feb 2009

      All,

I have completed my review of draft 33 of the standard. I've read
through the document as a whole at several times and spent a
considerable amount of time digesting each of the sections. I've
included nearly all my comments in the document (attached) but because
there are so many, and reconciling them with ongoing revisions of the
document may be difficult, I've included what I feel are the important
points in this email (I realize there are a lot of points here, but I
guess that's what happens when someone takes a totally fresh look at
things). I would like to note that these are all just suggestions - I
commented wherever I had a question or concern for completeness sake,
but certainly don't expect all of my feedback to be incorporated -
especially since many of the concerns may have already been discussed
and addressed in previous iterations of the standard. My only goal and
motivation is to and help make the best standard possible, and hopefully
some of these suggestions will be food for thought.

General

- The document feels overly verbose and explanatory to me. There are
many whole sections and blocks of text that, while very valuable, don't
really seem appropriate in a normative standards document. The document
should explain "what is," not necessarily "why it is." I understand that
it was previously discussed as to whether portions of the document
should be extracted and instead included in a separate non-normative
"DFDL Primer" similar to the way W3C structured the XML Schema standard.
My reaction is that doing so would help clean up the document. Using
technical books as a metaphor, my own feeling is that a normative
standard should be more like "The Definitive UNIX Reference" and less
like "Introduction to UNIX" or even "Expert-Level UNIX". I think the
standard falls a little too far into the latter category right now.

- Related to the previous comment, the section that seemed the most out
of place to me was the discussion on the parsing and unparsing processes
and their relationship to grammar and general parsing concepts ("DFDL
Properties Introduction"). Though the discussion was extremely valuable
from the standpoint of a potential implementer and may actually be the
only way to implement the standard, I think it may fall too far into the
"how to implement" category and might be more appropriate in an appendix
(marked as non-normative) or a primer if one is created. 

- There are certain sections that seemed a little misplaced to me. In
every case, it became clear over time why the document was organized the
way it was, but some revision may make it easier to digest. Generally,
it seemed that some concepts were "spread out" and not organized under
encompassing umbrella sections. Though the way it's currently structured
may make the document more componentized, it makes it harder to
understand from a "where do I look for all the information on X"
standpoint. The most obvious example is all the sections dealing with
representation properties such as the list of representation property
precedence and the sections on sequences, choices, etc. My thought is
that since all those sections are really discussing different
representation properties and aspects thereof that it seems reasonable
to group them into one overarching section. There are other areas where
I thought the organization could be improved including the discussion on
element vs. attribute vs. short binding forms (seemed misplaced given
that several other non-representation property annotation element
attributes such as setVariableName can use the alternate forms, and also
broke up the flow of annotation element descriptions) and the glossary
(I like the idea of defining specific terms used broadly throughout the
document to remove ambiguity, but a general purpose glossary feels more
appropriate as an appendix). To codify my thoughts, I worked up a TOC
that I think exhibits a more understandable organizational structure.
I've attached it not because I want or expect the entire structure of
the document to be modified, but as a "jumping off" point for
discussion.

- The standard references RFC 2119 for defining certain terms such as
MUST, SHALL, etc. In most other standards I've seen, emphasis is placed
on these terms when their meaning is to be taken from RFC 2119 - I would
suggest that DFDL do the same. It also appears as though the terms
aren't being used throughout the document as regularly as they could or
should be. I would suggest that at some point in the final revision
process we scrub the document for requirements concepts and make sure to
use the appropriate RFC 2119 terms where possible. This should remove
ambiguity about what's expected from implementations.

DFDL Information Set

- I'm sure this has already been discussed at length, but I wonder if it
would be possible to define the DFDL Information Set as an extension to
the XPath Data Model (XDM) as XSLT does for its data model. This would
have many advantages. The XDM is compatible with both the XML Schema
PSVI and the XML Information Set (and the XDM standard explicitly
explains the conversion process to and from each). This therefore
provides interoperability with the alternate representations and uses of
a DFDL Schema as an XML Schema and as plain XML content. Additionally,
an XDM (or some reasonable facsimile) will have to be constructed from
the DFDL Schema anyway to support the XPath capabilities of DFDL -
basing the DFDL Infoset on XDM to begin with would ensure seamless (or
at least easier) use with XPath libraries and infrastructure. Also, XDM
and DFDL both use the XML Schema type system (after all, DFDL is a
subset of XML Schema) and as such XDM already supports the DFDL types.

- If the above is infeasible or too big of a change for this late in the
process, would it at least be possible to define the DFDL Information
Set in terms of the XML Information Set standard? The DFDL Information
Set already appears to be loosely based on it, and may actually be
compatible (I don't know) but the relationship is not explicit. Without
such a statement and the satisfaction of the requirements of extending
the XML Information Set as defined in that standard, implementations
can't rely on the compatibility. If the relationship was made explicit
and we ensured that the DFDL Information Set was indeed compatible with
and extended the XML Information Set, then the XDM needed to process
DFDL expressions could be generated using the Infoset to XDM process
described in the XDM standard. If we went this route, I would also make
sure that we maintain compatibility with the PSVI - that is, we don't
want to introduce concepts or information set members that conflict with
the PSVI. This will make it easier on implementers because they could
potentially reuse the same internal infoset representation for both the
DFDL Information Set and the PSVI during validation processes.

- I imagine it will take some investigation to determine if either of
these options is possible and compatible with DFDL concepts - I don't
mind taking on the task if there is interest in modifying the DFDL
infoset. It just seems a shame to me to forgo an opportunity to
establish some synergy with related XML standards.

- In any case, the concept of simple element information items and
complex element information items seems contrary to established
convention. The concept of using character information items (and
groupings of them as explicitly allowed in the XML Information Set
standard) to represent child simple content has already been established
through the XML Information Set standard and other related XML
standards. It is especially confusing given that the same terms as the
XML Information Set are used.

- Should everything in the DFDL Information Set have a corresponding
representation in an XML document generated from or used to generate it
(not the DFDL Schema, but the result or input to parsing or unparsing)?
This question occurred to me based on the discussion in the most recent
teleconference about treating and representing comments as separate
kinds of content. It was suggested that the infoset would need to handle
comments in a special way as to differentiate them from non-commented
content. My concern is that there may not be an appropriate XML
representation of such an infoset item. Creating a special element in
the result document would break the property that the result of DFDL
processing can be validated by the DFDL Schema (because the commented
element wouldn't have been declared in the original DFDL Schema - it
couldn't be a declared element because comments can appear anywhere in
the source content). The only other option I can see would be to treat
source content identified as comments and indicated as such in the
infoset as XML comments in the result document. This brings up the
interesting complication during unparsing of differentiating between
"real" XML comments (those that should truly be ignored) and "output"
XML comments (those that should be output to the result stream as
commented content). The solution might be to treat all XML comments in a
document used for unparsing as available for output as commented
content, but it seems unreasonable to redefine XML comments in that way.
This brings us back to the original question: if there is no way to
adequately describe commented content in a resultant XML document, does
everything in the infoset need a representation in an XML result
document? What are the implications to upholding the ability to
round-trip (if the resultant XML document doesn't contain everything in
the infoset, and everything in the infoset is needed to fully describe
the source content, then unparsing the resultant XML document will not
result in the original source content)?

Annotation Elements and Representation Properties

- There seems to be inconsistencies throughout the document,
specifically in the descriptions of annotation elements and
representation properties. This is to be expected in a document that's
been under heavy revision over such a long time span, but an effort will
need to be made to scrub out all inconsistencies before the final
version. To this end, I've found creating a table of all annotation
elements and their properties helpful. I've attached what I have so far.
It has all annotation elements and their attributes and a notional start
for the representation properties. I intend to complete it as I go and
hopefully make sure everything matches up in the process.

- I'm not sure I understand the value in having the specialized
annotation elements. From the DFDL user/developer perspective it seems
more difficult because they need to recognize additional syntax. For
example, when they see a dfdl:choice annotation element they need to
understand that it's really a dfdl:format with a subset of allowed
representation properties appropriate to xs:choice elements. They still
must refer to the standard document to find out which representation
properties are allowed, and the alternate syntax doesn't necessarily
help in validation because a standard dfdl:format annotation element
would also have been valid (and the DFDL XML Schema can't determine
which representation properties are valid on a dfdl:format based only on
usage location). It also makes the document more confusing because
representation properties are refered to as being valid for specific
dfdl:* annotation elements as opposed to the real meaning which is that
they're valid for dfdl:format elements that annotate specific XML
elements. To put it another way, a representation property that is valid
for dfdl:choice is also only valid for dfdl:format when used as an
annotation of xs:choice or as a short form property on xs:choice
elements - but this isn't necessarily clear from the property
descriptions since they only refer to the dfdl:* special annotation
elements. From an implementation perspective, it adds complexity because
the extra element names must be accepted. The DFDL parser will still
have to validate representation properties and their validity as applied
to the parent schema element regardless of whether the annotation
element is a dfdl:format or a special annotation element. Not to
mention, wouldn't short form be used most frequently anyway, in which
case there are no annotation elements? In any case, I see very little
value for a disproportionate amount of added complexity and potential
confusion and I suggest the concept of special annotation properties
that restrict dfdl:format be removed.

- The standard isn't totally clear and unambiguous on the behavior with
respect to the dfdl:format selector property. It is mentioned that the
selector is externally identified, but no additional information is
given. Are the selectors implementation specific? If so, does that break
compatibility with alternate DFDL parsers if the selector property is
used? What if a selector is referred to but doesn't exist in the parser?
Is it a schema definition error or a parsing error (when are external
selectors resolved)? What if there is no "default" dfdl:format block and
they all contain non-matching selectors? Is that a processing error
(should be explicit)?

- When a defined format is put into use, how/when are the representation
properties checked for validity with respect to the schema element (such
as xs:choice) that put the defined format into use? To put it another
way, is it an error (and what kind) if a defined format specifies
representation properties that aren't valid for the schema element that
uses it? Can a defined format contain the special format annotation
elements (dfdl:sequence, dfdl:choice, etc.)?

- The standard says that a dfdl:defineFormat can contain any of the
other annotation elements. How are the other annotation elements
contained within a dfdl:defineFormat (such as dfdl:assert or
dfdl:hidden) applied when a named format definition is referenced by a
dfdl:format ref attribute? Can named format definitions be referenced
anywhere else other than where a dfdl:format is expected? Do all other
annotation elements make sense or be valid wherever a defined format
would be referenced? If not, suggest explicitly stating what annotation
elements are allowed within a dfdl:defineFormat as opposed to saying any
are allowed.

- The descriptions for dfdl:assert and dfdl:discriminator read very
similarly (and probably for good reason) but it's not clear how they're
different. If the failure of a dfdl:discriminator results in a
processing error, doesn't that make it equivalent to an assert? In other
words, how can it be used for control when one of the two possible
outcomes results in an error that (potentially) halts processing? May
want to refine the description of dfdl:discriminator.

- What about the positioning of hidden elements relative to siblings?
Can they appear anywhere within the parent - in which case, is relative
position important? May want to address this one way or the other.

- The properties for dfdl:textNumberFormat are defined in the
representation property section. Granted, they may be representation
properties from the conceptual level, but syntactically, they would
appear to be different. Can the text number format properties be used in
a dfdl:format or dfdl:property element? If not, then suggest treating
them more as attributes of the dfdl:textNumberFormat element and
defining them there. If so, then I wonder what the purpose of the
dfdl:textNumberFormat element is...it would seem to fall into the same
category as the other special dfdl:format annotation elements that
restrict the set of valid representation properties. Also seems to apply
to dfdl:defineEscapeScheme and its representation properties.

- The document isn't clear on how the position of a variable declaration
impacts its scope. Does it apply to all children of the element to which
the definition belong (regardless of position relative to the
definition), to all siblings following the definition (but not
preceding), or to all elements following the definition (regardless or
hierarchy). I assume the first, but more clarification would be helpful
in order to make it unambiguous.

- The value type for representation properties is listed in conceptual
terms, but shouldn't all properties actually accept one or more specific
XML Schema (within the DFDL subset) types (usually atomic)? Making this
explicit would remove confusion on the part of implementers. For
example, several are defined as 'Enum' - though the value may logically
be an enumerated type, the actual atomic type is something else like
xs:string or xs:token - with additional validation to ensure it's one of
the allowed enumerated values. The normative standard is first and
foremost a reference for implementations and as such should be totally
unambiguous with regard to typing information.

- There appear to be cases (such as alignment) where multiple types can
be accepted unnecessarily. In the alignment case, a specific xs:string
or a positive integer type is valid. Wouldn't this be easier on both the
DFDL XML Schema and the implementations if, wherever possible, only one
atomic type was accepted? In the alignment case, a xs:nonNegativeInteger
could be used where '0' means 'implicit'.

- With regard to case sensitivity, how is the case equivalency defined
for different character sets? Is this (or should it be) related to XPath
collations? Perhaps instead of an ignore case switch the user should be
allowed to specify a collation for initiator/terminator comparison and
the DFDL standard would require implementations include a
case-insensitive collation for common character sets. This would open
the door to using more general character/string comparison operations
and could be important in certain settings - for example, the XPath
standard has an example that 'v' and 'w' are equivalent in Swedish. This
may have some other advantages - if collations are needed for this kind
of thing, then we could probably support fn:compare and
fn:codepoint-equal in the DFDL XPath subset.

If you've made it this far, congrats :) Hopefully this list will spur
some discussion.

Thanks,

Dave

---
David Glick  |  dglick@dracorp.com <mailto:dglick@dracorp.com>   |
703.299.0700 x212
Data Research and Analysis Corp.  |  www.dracorp.com
<http://www.dracorp.com>

[DFDL-WG] DFDL Revision 033 Comments

Dave Glick