[dfdl-wg] Some things that need discussion

11 Mar 2005

      Here is a list of things that are important to resolve for DFDL but which I
don't recall seeing discussed or in a spec.

1) Rules for application of default values. XML Schema has rules for
default value application when handling XML instance documents (they are
different for elements and attributes). These rules could/should apply in
some non-XML circumstances but are not applicable to others. I think we
need to agree the rules that apply to different non-XML circumstances on
both input and output. For example,
- Fixed length data (eg, COBOL) - here the data must be present on input,
default values could be added if missing on output
- When a separator is present - this changes things as a double separator
can occur - but does this indicate missing or empty? Customers use both
semnatics, and Schema rules distinguish the two cases.
- When an initiator is present - here missing is observably different from
empty, so we could probably use Schema rules here.
I have a draft proposal I am working on for this which I could share.

2) Properties - on object or on object inclusion. I didn't see anywhere in
Mike's properties spec that said a property occurred on an
element/attribute per se, or on its use in a structure. Eg, offset is
clearly something that is only applicable to a local element or element
reference, you would not put it on a global element. But some other
properties are perhaps not so clear cut.

3) Mike's properties spec does not impose any restrictions on the use of
different ways of identifying an element in the bitstream. Examples:
- Optionality - are eg COBOL fixed length elements allowed to be optional -
if so how can you identify one is missing. Here the IBM model mandates that
maxOccurs must appear.
- Unordered content - what should a DFDL parser do when faced with an
xsd:all group - unless an initiator is present, should xsd:all be treated
as an xsd: sequence when parsing?

4) Wildcards and 'self-defining' content - what are the rules that apply
here?  While this might seem unimportant if your starting point is a fixed
file, in the messaging world this is frequently encountered - eg HL7 or X12
users will agree their own private extensions to the standard and add extra
data, we must be able to model and parse/write this.

5) Truncation/omission rules when separators are involved - we have some
extra options in the IBM model and several parsing rules, which we find
necessary to cope with our user's CSV style messages.

I should note at this point that the IBM model has the concept of
'separation type' which is a property of a group. It stipulates that all
members of that group follow a certain pattern - examples are 'fixed
length', 'separated', 'tagged and separated', 'use a regular expression'.
We have found this a convenient way to define rules for default value,
unordered content, wildcard and open content, etc, processing.  These
'separation types' can be considered as specializations of a general case
where the members of a group do not all follow a pattern. Clearly we need
to define rules for the general case, but I think we should also consider
whether such specializations are a useful addition to the DFDL model. I
would observe that the majority of customers element groups to indeed
follow one pattern or another (eg, COBOL - fixed length, CSV - separated).

Apologies if any of these have been discussed already, eg at the f-2-f, or
prior to GGF12, and I have missed it.

If it is more convenient we could start up a series of discussion documents
on the forum rather than the usual e-mail chain. I was certainly having
trouble tracking the various mails about multi-dimensional arrays.

Regards, Steve

Steve Hanson
WebSphere Business Integration Brokers,
IBM Hursley, England
Internet: smh@uk.ibm.com
Phone (+44)/(0) 1962-815848