I'm going to send this and then duck -
we've discussed the subject of missing-ness and defaulting at considerable
length already. However, I genuinely do have some new information for your
consideration so please hear me out.
I'm seeking the opinion of the working
group on the following questions:
a) can an element reliably be categorised
as 'missing' when separatorPolicy='suppressed'?
b) is it possible for an element to
be 'missing' if it has lengthKind='explicit' and its length is a static,
non-zero value?
c) is it possible for an element to
be 'missing' if it has a discriminator that has already evaluated to 'true'.
For reference, the specification ( v0.42
) says this concerning missing elements:
Definition 'missing element'
On parsing, an element is missing if its
content region in the data stream is empty. The initiator and terminator
regions of a missing element may, or may not, also be empty as controlled
by the dfdl:emptyValueDelimiterPolicy property (simple and complex element),
or dfdl:nilValueDelimiterPolicy property (simple element), .
Question a),
Compare the following data streams.
In both cases, assume that
- separator is comma and separatorPosition
is 'infix'
- missingValueDelimiterPolicy is set
to 'none' so a 'missing' value should not have an initiator.
- the initiators are A:, B: and C:
- values are a,b,c.
separatorPolicy='required' : A:a,,C:c
separatorPolicy='suppressed' : A:a,C:c
In the 'required' case, the parser detects
that the initiator is missing, then looks to see whether the content region
is zero-length. It is, so the element is 'missing'.
In the 'suppressed' case, the parser
detects that the initiator is missing, then looks to see whether the content
region is zero-length. It looks for a delimiter at the current position
and finds 'C'. 'C' is not a delimiter, so the content region is not zero-length.
So the parser throws a processing error - "initiator for element B
was not found in the data".
I don't think the 'suppressed' behaviour
is what a user will expect, nor what the WG intended when these rules were
drawn up. The problem is that the parser cannot reliably determine the
length of the content region when separatorPolicy='suppressed'. It
can, however, reliably detect whether the element is present - the initiator
gives a strong hint about that.
Somebody may say "well duh!. Of
course the content region is empty if the initiator is not present".
That may be a reasonable rule, but it is not the rule currently given in
the specification. Note that the content region has not been looked at,
so that rule relies on the parser speculatively parsing the element
and then backtracking because the initiator is not found. If we allow that,
then why not allow default values to be applied after other types of processing
error ( even for cases where no initiator was defined )? There are good
reasons for not applying defaults after normal backtracking ( hence
the current rule ) so any such 'missing initiator implies empty content'
rule would have to made explicit in the specification.
Possible refinements of the rules:
a) IF the length of the content region
cannot reliably be determined ( lengthKind='delimited and separatorPolicy=suppressed
) AND emptyValueDelimiterPolicy does not include the initiator AND the
element has an initiator AND the initiator was not found THEN assume that
the content length is zero and treat the element as missing.
or
b) IF (the element has an initiator
AND the initiator was not found )THEN IF the parent group has initiatedContent='yes'
THEN the element is missing else apply the existing rules.
b) would provide a way to get defaults
applied in situations where the content region's length is either fixed
or undefined. Quite a lot of users might assume this behaviour anyway.
Question b)
A similar situation can arise when lengthKind='explicit'
and the length is fixed ( i.e. is not a DFDL expression ). Unless the missing
field occurs at the end of a known-length structure the length of the content
region will
never be zero. I think a similar rule
is required for this case also:
- IF the length of the content region
is fixed ( lengthKind='explicit' and length is a static, non-zero value
) AND emptyValueDelimiterPolicy does not include the initiator AND the
element has an initiator AND the initiator was not found THEN assume that
the content length is zero and treat the element as missing.
...or apply suggestion b) above.
Question c)
Suppose that an element has a discriminator,
and it has already evaluated to 'true' ( it must have been a backward reference
to some previously-parsed field ). The discriminator has unambiguously
stated that the element *is* present in the data. If it is subsequently
found to have a zero-length content region, should the parser treat it
as 'missing' and attempt to apply a default?. I don't think so.
Please tell me that I'm missing something
obvious here - it's starting to sound complicated again.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU