All,
I have completed my review of draft 33 of the standard.
I’ve read through the document as a whole at several times and spent a
considerable amount of time digesting each of the sections. I’ve included
nearly all my comments in the document (attached) but because there are so
many, and reconciling them with ongoing revisions of the document may be
difficult, I’ve included what I feel are the important points in this
email (I realize there are a lot of points here, but I guess that’s what
happens when someone takes a totally fresh look at things). I would like to note
that these are all just suggestions – I commented wherever I had a
question or concern for completeness sake, but certainly don’t expect all
of my feedback to be incorporated – especially since many of the concerns
may have already been discussed and addressed in previous iterations of the
standard. My only goal and motivation is to and help make the best standard
possible, and hopefully some of these suggestions will be food for thought.
General
- The document feels overly verbose and explanatory to me.
There are many whole sections and blocks of text that, while very valuable,
don’t really seem appropriate in a normative standards document. The
document should explain “what is,” not necessarily “why it
is.” I understand that it was previously discussed as to whether portions
of the document should be extracted and instead included in a separate
non-normative “DFDL Primer” similar to the way W3C structured the
XML Schema standard. My reaction is that doing so would help clean up the
document. Using technical books as a metaphor, my own feeling is that a
normative standard should be more like “The Definitive UNIX
Reference” and less like “Introduction to UNIX” or even
“Expert-Level UNIX”. I think the standard falls a little too far
into the latter category right now.
- Related to the previous comment, the section that seemed
the most out of place to me was the discussion on the parsing and unparsing
processes and their relationship to grammar and general parsing concepts
(“DFDL Properties Introduction”). Though the discussion was
extremely valuable from the standpoint of a potential implementer and may
actually be the only way to implement the standard, I think it may fall too far
into the “how to implement” category and might be more appropriate
in an appendix (marked as non-normative) or a primer if one is created.
- There are certain sections that seemed a little misplaced
to me. In every case, it became clear over time why the document was organized
the way it was, but some revision may make it easier to digest. Generally, it
seemed that some concepts were “spread out” and not organized under
encompassing umbrella sections. Though the way it’s currently structured
may make the document more componentized, it makes it harder to understand from
a “where do I look for all the information on X” standpoint. The
most obvious example is all the sections dealing with representation properties
such as the list of representation property precedence and the sections on
sequences, choices, etc. My thought is that since all those sections are really
discussing different representation properties and aspects thereof that it
seems reasonable to group them into one overarching section. There are other
areas where I thought the organization could be improved including the
discussion on element vs. attribute vs. short binding forms (seemed misplaced
given that several other non-representation property annotation element
attributes such as setVariableName can use the alternate forms, and also broke
up the flow of annotation element descriptions) and the glossary (I like the
idea of defining specific terms used broadly throughout the document to remove
ambiguity, but a general purpose glossary feels more appropriate as an
appendix). To codify my thoughts, I worked up a TOC that I think exhibits a
more understandable organizational structure. I’ve attached it not
because I want or expect the entire structure of the document to be modified,
but as a “jumping off” point for discussion.
- The standard references RFC 2119 for defining certain
terms such as MUST, SHALL, etc. In most other standards I’ve seen,
emphasis is placed on these terms when their meaning is to be taken from RFC
2119 – I would suggest that DFDL do the same. It also appears as though
the terms aren’t being used throughout the document as regularly as they
could or should be. I would suggest that at some point in the final revision
process we scrub the document for requirements concepts and make sure to use
the appropriate RFC 2119 terms where possible. This should remove ambiguity
about what’s expected from implementations.
DFDL Information Set
- I’m sure this has already been discussed at length,
but I wonder if it would be possible to define the DFDL Information Set as an
extension to the XPath Data Model (XDM) as XSLT does for its data model. This
would have many advantages. The XDM is compatible with both the XML Schema PSVI
and the XML Information Set (and the XDM standard explicitly explains the
conversion process to and from each). This therefore provides interoperability
with the alternate representations and uses of a DFDL Schema as an XML Schema
and as plain XML content. Additionally, an XDM (or some reasonable facsimile)
will have to be constructed from the DFDL Schema anyway to support the XPath
capabilities of DFDL – basing the DFDL Infoset on XDM to begin with would
ensure seamless (or at least easier) use with XPath libraries and infrastructure.
Also, XDM and DFDL both use the XML Schema type system (after all, DFDL is a
subset of XML Schema) and as such XDM already supports the DFDL types.
- If the above is infeasible or too big of a change for this
late in the process, would it at least be possible to define the DFDL
Information Set in terms of the XML Information Set standard? The DFDL
Information Set already appears to be loosely based on it, and may actually be
compatible (I don’t know) but the relationship is not explicit. Without
such a statement and the satisfaction of the requirements of extending the XML
Information Set as defined in that standard, implementations can’t rely
on the compatibility. If the relationship was made explicit and we ensured that
the DFDL Information Set was indeed compatible with and extended the XML
Information Set, then the XDM needed to process DFDL expressions could be
generated using the Infoset to XDM process described in the XDM standard. If we
went this route, I would also make sure that we maintain compatibility with the
PSVI – that is, we don’t want to introduce concepts or information
set members that conflict with the PSVI. This will make it easier on
implementers because they could potentially reuse the same internal infoset
representation for both the DFDL Information Set and the PSVI during validation
processes.
- I imagine it will take some investigation to determine if either
of these options is possible and compatible with DFDL concepts – I
don’t mind taking on the task if there is interest in modifying the DFDL
infoset. It just seems a shame to me to forgo an opportunity to establish some synergy
with related XML standards.
- In any case, the concept of simple element information
items and complex element information items seems contrary to established
convention. The concept of using character information items (and groupings of
them as explicitly allowed in the XML Information Set standard) to represent
child simple content has already been established through the XML Information
Set standard and other related XML standards. It is especially confusing given
that the same terms as the XML Information Set are used.
- Should everything in the DFDL Information Set have a
corresponding representation in an XML document generated from or used to
generate it (not the DFDL Schema, but the result or input to parsing or
unparsing)? This question occurred to me based on the discussion in the most
recent teleconference about treating and representing comments as separate
kinds of content. It was suggested that the infoset would need to handle
comments in a special way as to differentiate them from non-commented content.
My concern is that there may not be an appropriate XML representation of such
an infoset item. Creating a special element in the result document would break
the property that the result of DFDL processing can be validated by the DFDL
Schema (because the commented element wouldn’t have been declared in the
original DFDL Schema – it couldn’t be a declared element because
comments can appear anywhere in the source content). The only other option I
can see would be to treat source content identified as comments and indicated
as such in the infoset as XML comments in the result document. This brings up
the interesting complication during unparsing of differentiating between
“real” XML comments (those that should truly be ignored) and
“output” XML comments (those that should be output to the result
stream as commented content). The solution might be to treat all XML comments
in a document used for unparsing as available for output as commented content,
but it seems unreasonable to redefine XML comments in that way. This brings us
back to the original question: if there is no way to adequately describe
commented content in a resultant XML document, does everything in the infoset
need a representation in an XML result document? What are the implications to
upholding the ability to round-trip (if the resultant XML document
doesn’t contain everything in the infoset, and everything in the infoset
is needed to fully describe the source content, then unparsing the resultant
XML document will not result in the original source content)?
Annotation Elements and Representation
Properties
- There seems to be inconsistencies throughout the document,
specifically in the descriptions of annotation elements and representation
properties. This is to be expected in a document that’s been under heavy
revision over such a long time span, but an effort will need to be made to
scrub out all inconsistencies before the final version. To this end, I’ve
found creating a table of all annotation elements and their properties helpful.
I’ve attached what I have so far. It has all annotation elements and
their attributes and a notional start for the representation properties. I
intend to complete it as I go and hopefully make sure everything matches up in
the process.
- I’m not sure I understand the value in having the
specialized annotation elements. From the DFDL user/developer perspective it
seems more difficult because they need to recognize additional syntax. For
example, when they see a dfdl:choice annotation element they need to understand
that it’s really a dfdl:format with a subset of allowed representation
properties appropriate to xs:choice elements. They still must refer to the
standard document to find out which representation properties are allowed, and
the alternate syntax doesn’t necessarily help in validation because a
standard dfdl:format annotation element would also have been valid (and the
DFDL XML Schema can’t determine which representation properties are valid
on a dfdl:format based only on usage location). It also makes the document more
confusing because representation properties are refered to as being valid for
specific dfdl:* annotation elements as opposed to the real meaning which is
that they’re valid for dfdl:format elements that annotate specific XML
elements. To put it another way, a representation property that is valid for
dfdl:choice is also only valid for dfdl:format when used as an annotation of
xs:choice or as a short form property on xs:choice elements – but this
isn’t necessarily clear from the property descriptions since they only
refer to the dfdl:* special annotation elements. From an implementation perspective,
it adds complexity because the extra element names must be accepted. The DFDL
parser will still have to validate representation properties and their validity
as applied to the parent schema element regardless of whether the annotation
element is a dfdl:format or a special annotation element. Not to mention,
wouldn’t short form be used most frequently anyway, in which case there
are no annotation elements? In any case, I see very little value for a disproportionate
amount of added complexity and potential confusion and I suggest the concept of
special annotation properties that restrict dfdl:format be removed.
- The standard isn’t totally clear and unambiguous on
the behavior with respect to the dfdl:format selector property. It is mentioned
that the selector is externally identified, but no additional information is
given. Are the selectors implementation specific? If so, does that break
compatibility with alternate DFDL parsers if the selector property is used?
What if a selector is referred to but doesn’t exist in the parser? Is it
a schema definition error or a parsing error (when are external selectors
resolved)? What if there is no “default” dfdl:format block and they
all contain non-matching selectors? Is that a processing error (should be
explicit)?
- When a defined format is put into use, how/when are the
representation properties checked for validity with respect to the schema
element (such as xs:choice) that put the defined format into use? To put it
another way, is it an error (and what kind) if a defined format specifies
representation properties that aren’t valid for the schema element that uses
it? Can a defined format contain the special format annotation elements
(dfdl:sequence, dfdl:choice, etc.)?
- The standard says that a dfdl:defineFormat can contain any
of the other annotation elements. How are the other annotation elements
contained within a dfdl:defineFormat (such as dfdl:assert or dfdl:hidden)
applied when a named format definition is referenced by a dfdl:format ref
attribute? Can named format definitions be referenced anywhere else other than
where a dfdl:format is expected? Do all other annotation elements make sense or
be valid wherever a defined format would be referenced? If not, suggest
explicitly stating what annotation elements are allowed within a
dfdl:defineFormat as opposed to saying any are allowed.
- The descriptions for dfdl:assert and dfdl:discriminator
read very similarly (and probably for good reason) but it’s not clear how
they’re different. If the failure of a dfdl:discriminator results in a
processing error, doesn’t that make it equivalent to an assert? In other
words, how can it be used for control when one of the two possible outcomes
results in an error that (potentially) halts processing? May want to refine the
description of dfdl:discriminator.
- What about the positioning of hidden elements relative to
siblings? Can they appear anywhere within the parent - in which case, is
relative position important? May want to address this one way or the other.
- The properties for dfdl:textNumberFormat are defined in
the representation property section. Granted, they may be representation
properties from the conceptual level, but syntactically, they would appear to
be different. Can the text number format properties be used in a dfdl:format or
dfdl:property element? If not, then suggest treating them more as attributes of
the dfdl:textNumberFormat element and defining them there. If so, then I wonder
what the purpose of the dfdl:textNumberFormat element is…it would seem to
fall into the same category as the other special dfdl:format annotation
elements that restrict the set of valid representation properties. Also seems
to apply to dfdl:defineEscapeScheme and its representation properties.
- The document isn’t clear on how the position of a
variable declaration impacts its scope. Does it apply to all children of the
element to which the definition belong (regardless of position relative to the
definition), to all siblings following the definition (but not preceding), or
to all elements following the definition (regardless or hierarchy). I assume
the first, but more clarification would be helpful in order to make it
unambiguous.
- The value type for representation properties is listed in
conceptual terms, but shouldn’t all properties actually accept one or
more specific XML Schema (within the DFDL subset) types (usually atomic)?
Making this explicit would remove confusion on the part of implementers. For
example, several are defined as ‘Enum’ – though the value may
logically be an enumerated type, the actual atomic type is something else like xs:string
or xs:token – with additional validation to ensure it’s one of the allowed
enumerated values. The normative standard is first and foremost a reference for
implementations and as such should be totally unambiguous with regard to typing
information.
- There appear to be cases (such as alignment) where
multiple types can be accepted unnecessarily. In the alignment case, a specific
xs:string or a positive integer type is valid. Wouldn’t this be easier on
both the DFDL XML Schema and the implementations if, wherever possible, only
one atomic type was accepted? In the alignment case, a xs:nonNegativeInteger
could be used where ‘0’ means ‘implicit’.
- With regard to case sensitivity, how is the case
equivalency defined for different character sets? Is this (or should it be)
related to XPath collations? Perhaps instead of an ignore case switch the user
should be allowed to specify a collation for initiator/terminator comparison
and the DFDL standard would require implementations include a case-insensitive
collation for common character sets. This would open the door to using more
general character/string comparison operations and could be important in
certain settings – for example, the XPath standard has an example that
‘v’ and ‘w’ are equivalent in Swedish. This may have
some other advantages – if collations are needed for this kind of thing,
then we could probably support fn:compare and fn:codepoint-equal in the DFDL
XPath subset.
If you’ve made it this far, congrats :) Hopefully this
list will spur some discussion.
Thanks,
Dave
---
David Glick | dglick@dracorp.com
| 703.299.0700 x212
Data Research and Analysis Corp. | www.dracorp.com