IBM has continued its review of the proposed additions to lengthKind and
occursCountKind to simplify the modelling of MIL-STD-2045 formats. The
email below carries on from an earlier email but has removed everything to
do with bitOrder etc. New stuff is in blue.
Regards
Steve Hanson
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh(a)uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 28/07/2014 11:31 -----
1) Proposed new dfdl:lengthKind 'fixedLengthOrTerminated'.
A new enum implies that it can be used in any scenario, so the following
need to be specified.
dfdl:terminator must be set and can not be empty string or contain ES on
its own
If xs:string or xs:hexBinary, can maxLength facet be used instead of
dfdl:length? (Suggest no - this is variable length data so min/maxLength
are for validation only).
Can dfdl:length be an expression? (Suggest no unless specific use case
identified)
My use case needs only constants as the maximum, hence enum name contains
"fixed" prefix, not "explicit".
Any special rules for emptyValueDelimiterPolicy and
nilValueDelimiterPolicy ?
Since a terminator must be set, then these cannot be "none" or
"initiator".
SMH: Doesn't follow. Today, if I specify a terminator, it must be present,
modulo EVDP/NVDP. So why is the same not true for the new enum? If we add
a new enum, it has to work in a way that is consistent with other
lengthKinds and not just for MIL-STD-2045 use cases.
Use on complex element. Presumably dfdl:length is first used to extract a
'box' but within that box does parser immediately scan for the
dfdl:terminator or does it descend into the complex type and parse the
children, expecting to either consume all the box or to find the
terminator at the end? (Suggest the latter).
I have no use case that requires this for complex types at all.
Perhaps we can dodge this by having it be simpleFixedLengthOrTerminated,
and restricting it to simple types only. ?
SMH: Perhaps, but that makes this lengthKind enum different from all the
others, and that doesn't seem right.
Use on complex element. Last child can not be dfdl:lengthKind
'endOfParent'.
Scanning rules: Use of this new dfdl:lengthKind switches off any in-scope
stack of terminating markup in force at that point. Put another way, when
we are scanning for the dfdl:terminator, we are not looking for any markup
from an outer scope.
So there's plenty to think about with this new dfdl:lengthKind. A good
rule for deciding whether a new dfdl:length or dfdl:occursCountKind should
be added is whether it bends some other part of the spec out of shape. The
new dfdl:lengthKind looks ok so far.
However we *think* we have come up with an alternative model which is
simpler than you one you state in the document. Example for field 'varstr'
with max length 100:
<xs:sequence dfdl:terminator="{if (fn:str-len(varstr) eq 100) then '%ES;'
else '%DEL'}" ...>
<xs:element name="varstr" type="xs:string"
dfdl:lengthKind="pattern" dfdl:pattern="([^\x7F].\x7F)|(.{100})" ... />
</xs:sequence>
Can't put dfdl:terminator with a self-referencing expression on the
element. Might need fn:exists in the dfdl:terminator expression to handle
optionality. Does that work?
I don't think this will work as %ES isn't allowed in terminators.
There is a proposal to allow it, but only when length kind is such that
one is not scanning for delimiters (same restriction as for WSP*). Let's
assume that we allow %ES for now.
SMH: This has been incorporated as an update to erratum 2.148 and is the
latest spec draft.
One beauty of your idea here is that unparsing will "just work", so that's
nice.
But I think your pattern has a bug: I think it should be
dfdl:pattern="[^\x7F]{0,99}(?=\x7F)| .{100}"
This will not capture more than 99 characters prior to the DEL, and will
not include the DEL as part of the string in the case where a DEL is found
(uses lookahead in regex). Hence, the DEL will be available to be picked
off as the terminator. Without this you end up with the DEL in the
payload.
With that I think your approach would work. So thanks for that idea.
SMH: Yes my pattern was wrong, thanks for correcting.
SMH: Also realised that the dfdl:terminator expression is illegal, as it
looks downwards. The correct DFDL is:
<xs:sequence ...>
<xs:element name="varstr" type="xs:string"
dfdl:lengthKind="pattern" dfdl:pattern="[^\x7F]{0,9}(?=\x7F)|.{10}" ... />
<xs:sequence dfdl:terminator="{if (fn:string-length(./varstr) eq
10) then '%ES;' else '%DEL'}" .../>
</xs:sequence>
I have tested this (using {if (fn:string-length(./varstr) eq 10) then
'%WSP*;' else '%DEL;'} as %ES; not yet allowed in terminator) and it works
ok both parse and unparse.
It was noted that if the terminator expression was allowed to refer to the
value of its own element then this could be simplified to:
<xs:element name="varstr" type="xs:string"
dfdl:lengthKind="pattern" dfdl:pattern="[^\x7F]{0,9}(?=\x7F)|.{10}"
dfdl:terminator="{if (fn:string-length(.) eq 10)
then '%ES;' else '%DEL'}" .../>
Clearly this relaxation could only occur when lengthKind was not
delimited. (That is, the same condition that we have proposed allowing
%ES; for terminator/separator). But I think it also violates the
known-to-exist rules ? Certainly IBM DFDL says it can't find '.' in the
infoset when I tried this. So perhaps this is not a good idea.
2) Proposed new dfdl:occursCountKind 'prefixed'.
The motivation here is to avoid the explosion of global groups needed for
the hidden presence indicators. It was observed that a single global group
could be used if the expression used a predicate when referring to the FPI
element, though obviously that makes the schema very fragile.
At first glance the new enum would appear to be symmetric with lengthKind
'prefixed', but on closer examination this is not true:
Presumably the new enum would apply to optional elements and arrays. It
would have to fit into the grammar thus:
Array = [ [PrefixOccursCount Separator] EnclosedElement [
Separator EnclosedElement ]* [ Separator StopValue] ]
PrefixOccursCount = SimpleNormalRep
It would be wrong to couple the prefix more tightly to the first
occurrence (by more tightly I mean like prefix length where the length
occurs after the element's left framing region). When parsing, if the
value is 0 then nothing else is expected in the data - zero occurrences,
so no other DFDL properties are even examined. It must therefore occur
ahead of all occurrences. If it is doing that, then it may as well have
its own left and right framing, hence use of SimpleNormalRep rather than
SimpleContent, and work with delimiters.
However IBM questions the need for the enum as it can also be modelled
using a choice of two sequences which, if you put the discriminator on the
hidden FPI element itself, means you can get away with just two global
groups. And you don't need outputValueCalc as you can just use defaults.
...
<!-- Element unit_name -->
<xs:choice>
<xs:sequence>
<xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_true" />
<xs:element name="unit_name" type="..." ... />
</xs:sequence>
<xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_false" />
</xs:choice>
<!-- Element unit_type -->
<xs:choice>
<xs:sequence>
<xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_true" />
<xs:element name="unit_type" type="..." ... />
</xs:sequence>
<xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_false" />
</xs:choice>
...
<xs:group name="vmdfdl:gh_mil_std_2045_FPI_true" >
<xs:sequence>
<xs:element name="FPI" type="xs:boolean" default="true" ... >
<dfdl:discriminator test="{. eq fn:true()}"
</xs:element>
</xs:sequence>
</xs:group>
<xs:group name="vmdfdl:gh_mil_std_2045_FPI_false" >
<xs:sequence>
<xs:element name="FPI" type="xs:boolean" default="false" ... >
</xs:element>
</xs:sequence>
</xs:group>
3) Proposed new dfdl:occursCountKind 'repeatUntil'.
It seems to IBM that the only practical effect of the new enum
'repeatUntil' is to simplify the discriminator. It doesn't remove it nor
does it remove the need for the hidden FRI element. IBM does not see the
benefit of the new enum in its proposed form. Further...
If the above proposal is used for the FPI, the dfdl:occursIndex() branch
of the discriminator simplifies to fn:true().
The FRI is local to the array element so, when parsing at least, there is
no need for a globally unique group for each array.
That simplifies the discriminator to the following and means you only need
one global group for FRI.
<dfdl:discriminator>
if (dfdl:occursIndex() eq 1 then fn:true() else
../<array>[dfdl:occursIndex()-1]/vmfdfdl:gh_mil_std_2045_FRI
<dfdl:discriminator>
For that to work on unparsing there needs to be a generic way to set the
(Boolean) FRI from within the hidden group. Something like
dfdl:outputValueCalc="{dfdl:occursIndex() eq fn:count(..)}"
There is a problem with this though. The property is on the FRI element so
what does dfdl:occursIndex() return? The spec says it returns "the
position of the current item within an array" but also says "this function
may be used on non-array elements". I'm not clear what it would return for
the latter case - does it return 1 or does it look back to its parent or
... ? Here we want the index of the parent. Perhaps this function needs to
take an argument to be unambiguous, eg, . or .. or ../.., ie, it can only
refer back up to the root. (In fact this problem applies whether or not
there is a single FRI or one per array).
A counter proposal...
One way to really simplify this type of occurrence indicator is to
consider it as part of the element, in the same way as a length prefix.
This tight binding makes sense here, because there is an indicator per
occurrence.
dfdl:occursCountKind="stopIndicator'
dfdl:occursStopIndicatorType="<type>"
The stop indicator type must be derived from xs:boolean. True means the
occurrence is the last. False means it is not. Or we can do it the other
way round) The DFDL Boolean properties of the type can always be used to
compensate. The parser would work a bit like it does for 'stopValue' - it
keeps parsing speculatively until it finds an occurrence which indicates
the end of the array - the difference being that in this case it is added
to the infoset. The oddity about this is that it applies to arrays only
and does not work with optional elements, so it can not be used with
minOccurs = '0'.
Grammar becomes:
SimpleNormalRep = LeftFraming StopIndicator PrefixLength SimpleContent
RightFraming
ComplexNormalRep = LeftFraming StopIndicator PrefixLength ComplexContent
ElementUnused RightFraming
StopIndicator = SimpleContent
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU