IBM has continued its review of the proposed
additions to lengthKind and occursCountKind to simplify the modelling of
MIL-STD-2045 formats. The email below carries on from an earlier
email but has removed everything to do with bitOrder etc. New
stuff is in blue.
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve
Hanson/UK/IBM on 28/07/2014 11:31 -----
1) Proposed new dfdl:lengthKind
'fixedLengthOrTerminated'.
A new enum implies that it can be used
in any scenario, so the following need to be specified.
- dfdl:terminator must be set and can
not be empty string or contain ES on its own
- If xs:string or xs:hexBinary, can maxLength
facet be used instead of dfdl:length? (Suggest no - this is variable length
data so min/maxLength are for validation only).
- Can dfdl:length be an expression? (Suggest
no unless specific use case identified)
My use case
needs only constants as the maximum, hence enum name contains "fixed"
prefix, not "explicit".
- Any special rules for emptyValueDelimiterPolicy
and nilValueDelimiterPolicy ?
Since a terminator
must be set, then these cannot be "none" or "initiator".
SMH: Doesn't follow. Today, if I specify a terminator,
it must be present, modulo EVDP/NVDP. So why is the same not true for the
new enum? If we add a new enum, it has to work in a way that is consistent
with other lengthKinds and not just for MIL-STD-2045 use cases.
- Use on complex element. Presumably dfdl:length
is first used to extract a 'box' but within that box does parser immediately
scan for the dfdl:terminator or does it descend into the complex type and
parse the children, expecting to either consume all the box or to find
the terminator at the end? (Suggest the latter).
I
have no use case that requires this for complex types at all.
Perhaps we can dodge this by having it be simpleFixedLengthOrTerminated,
and restricting it to simple types only. ?
SMH: Perhaps, but that makes this lengthKind
enum different from all the others, and that doesn't seem right.
- Use on complex element. Last child can
not be dfdl:lengthKind 'endOfParent'.
- Scanning rules: Use of this new dfdl:lengthKind
switches off any in-scope stack of terminating markup in force at that
point. Put another way, when we are scanning for the dfdl:terminator, we
are not looking for any markup from an outer scope.
So there's plenty to think about
with this new dfdl:lengthKind. A good rule for deciding whether a new dfdl:length
or dfdl:occursCountKind should be added is whether it bends some other
part of the spec out of shape. The new dfdl:lengthKind looks ok so far.
However we *think* we have come up with an alternative model which is simpler
than you one you state in the document. Example for field 'varstr' with
max length 100:
<xs:sequence dfdl:terminator="{if (fn:str-len(varstr) eq 100) then
'%ES;' else '%DEL'}" ...>
<xs:element name="varstr" type="xs:string"
dfdl:lengthKind="pattern" dfdl:pattern="([^\x7F].\x7F)|(.{100})"
... />
</xs:sequence>
Can't put dfdl:terminator with a self-referencing expression on the element.
Might need fn:exists in the dfdl:terminator expression to handle optionality.
Does that work?
I don't think this will work as %ES isn't allowed in terminators.
There is a proposal to allow it, but only when length kind is such that
one is not scanning for delimiters (same restriction as for WSP*). Let's
assume that we allow %ES for now.
SMH: This has been incorporated as an update
to erratum 2.148 and is the latest spec draft.
One beauty of your idea here is that unparsing will "just
work", so that's nice.
But I think your pattern has a bug: I think it should be dfdl:pattern="[^\x7F]{0,99}(?=\x7F)|
.{100}"
This will not capture more than 99 characters prior to
the DEL, and will not include the DEL as part of the string in the case
where a DEL is found (uses lookahead in regex). Hence, the DEL will be
available to be picked off as the terminator. Without this you end up with
the DEL in the payload.
With that I think your approach would work. So thanks for that idea.
SMH: Yes my pattern was wrong, thanks for correcting.
SMH: Also realised that
the dfdl:terminator expression is illegal, as it looks downwards. The correct
DFDL is:
<xs:sequence ...>
<xs:element name="varstr" type="xs:string"
dfdl:lengthKind="pattern" dfdl:pattern="[^\x7F]{0,9}(?=\x7F)|.{10}"
... />
<xs:sequence dfdl:terminator="{if
(fn:string-length(./varstr) eq 10) then '%ES;' else '%DEL'}" .../>
</xs:sequence>
I have tested this (using
{if (fn:string-length(./varstr) eq 10) then '%WSP*;' else '%DEL;'} as %ES;
not yet allowed in terminator) and it works ok both parse and unparse.
It was noted that if the
terminator expression was allowed to refer to the value of its own element
then this could be simplified to:
<xs:element name="varstr" type="xs:string" dfdl:lengthKind="pattern"
dfdl:pattern="[^\x7F]{0,9}(?=\x7F)|.{10}"
dfdl:terminator="{if (fn:string-length(.)
eq 10) then '%ES;' else '%DEL'}" .../>
Clearly this relaxation could
only occur when lengthKind was not delimited. (That is, the same
condition that we have proposed allowing %ES; for terminator/separator).
But I think it also violates the known-to-exist rules ? Certainly IBM DFDL
says it can't find '.' in the infoset when I tried this. So perhaps this
is not a good idea.
2) Proposed new
dfdl:occursCountKind 'prefixed'.
The motivation here is to
avoid the explosion of global groups needed for the hidden presence indicators.
It was observed that a single global group could be used if the expression
used a predicate when referring to the FPI element, though obviously that
makes the schema very fragile.
At first glance the new enum
would appear to be symmetric with lengthKind 'prefixed', but on closer
examination this is not true:
- Presumably the new enum would
apply to optional elements and arrays. It would have to fit into
the grammar thus:
Array
= [ [PrefixOccursCount Separator]
EnclosedElement [ Separator EnclosedElement ]* [ Separator
StopValue] ]
PrefixOccursCount = SimpleNormalRep
It would be wrong to couple
the prefix more tightly to the first occurrence (by more tightly I mean
like prefix length where the length occurs after the element's left
framing region). When parsing, if the value is 0 then nothing else
is expected in the data - zero occurrences, so no other DFDL properties
are even examined. It must therefore occur ahead of all occurrences. If
it is doing that, then it may as well have its own left and right framing,
hence use of SimpleNormalRep rather than SimpleContent, and work with delimiters.
However IBM questions the
need for the enum as it can also be modelled using a choice of two sequences
which, if you put the discriminator on the hidden FPI element itself, means
you can get away with just two global groups. And you don't need
outputValueCalc as you can just use defaults.
...
<!-- Element
unit_name -->
<xs:choice>
<xs:sequence>
<xs:sequence
dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_true" />
<xs:element
name="unit_name" type="..." ... />
</xs:sequence>
<xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_false"
/>
</xs:choice>
<!-- Element
unit_type -->
<xs:choice>
<xs:sequence>
<xs:sequence
dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_true" />
<xs:element
name="unit_type" type="..." ... />
</xs:sequence>
<xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_false"
/>
</xs:choice>
...
<xs:group name="vmdfdl:gh_mil_std_2045_FPI_true"
>
<xs:sequence>
<xs:element
name="FPI" type="xs:boolean" default="true"
... >
<dfdl:discriminator test="{. eq fn:true()}"
</xs:element>
</xs:sequence>
</xs:group>
<xs:group name="vmdfdl:gh_mil_std_2045_FPI_false"
>
<xs:sequence>
<xs:element
name="FPI" type="xs:boolean" default="false"
... >
</xs:element>
</xs:sequence>
</xs:group>
3) Proposed
new dfdl:occursCountKind 'repeatUntil'.
It seems to IBM that the only practical
effect of the new enum 'repeatUntil' is to simplify the discriminator.
It doesn't remove it nor does it remove the need for the hidden FRI element.
IBM does not see the benefit of the new enum in its proposed form. Further...
- If the above proposal is
used for the FPI, the dfdl:occursIndex() branch of the discriminator simplifies
to fn:true().
- The FRI is local to the array
element so, when parsing at least, there is no need for a globally unique
group for each array.
That
simplifies the discriminator to the following and means you only need one
global group for FRI.
<dfdl:discriminator>
if (dfdl:occursIndex() eq 1 then fn:true() else ../<array>[dfdl:occursIndex()-1]/vmfdfdl:gh_mil_std_2045_FRI
<dfdl:discriminator>
For that to work on unparsing there
needs to be a generic way to set the (Boolean) FRI from within the hidden
group. Something like
dfdl:outputValueCalc="{dfdl:occursIndex()
eq fn:count(..)}"
There is a problem with this though.
The property is on the FRI element so what does dfdl:occursIndex() return?
The spec says it returns "the position of the current item within
an array" but also says "this function may be used on non-array
elements". I'm not clear what it would return for the latter case
- does it return 1 or does it look back to its parent or ... ? Here
we want the index of the parent. Perhaps this function needs to
take an argument to be unambiguous, eg, . or .. or ../.., ie, it can only
refer back up to the root. (In fact this problem applies whether
or not there is a single FRI or one per array).
A counter proposal...
One way to really simplify
this type of occurrence indicator is to consider it as part of the element,
in the same way as a length prefix. This tight binding makes sense here,
because there is an indicator per occurrence.
dfdl:occursCountKind="stopIndicator' dfdl:occursStopIndicatorType="<type>"
The stop indicator type must
be derived from xs:boolean. True means the occurrence is the last. False
means it is not. Or we can do it the other way round) The DFDL Boolean
properties of the type can always be used to compensate. The parser would
work a bit like it does for 'stopValue' - it keeps parsing speculatively
until it finds an occurrence which indicates the end of the array - the
difference being that in this case it is added to the infoset. The
oddity about this is that it applies to arrays only and does not work with
optional elements, so it can not be used with minOccurs = '0'.
Grammar becomes:
SimpleNormalRep = LeftFraming
StopIndicator
PrefixLength SimpleContent RightFraming
ComplexNormalRep = LeftFraming
StopIndicator
PrefixLength ComplexContent ElementUnused RightFraming
StopIndicator = SimpleContent
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU