IBM has continued its review of the proposed additions to lengthKind and occursCountKind to simplify the modelling of MIL-STD-2045 formats.  The email below carries on from an earlier email but has removed everything to do with bitOrder etc. New stuff is in blue.

Regards
 
Steve Hanson
Architect,
IBM DFDL
Co-Chair,
OGF DFDL Working Group
IBM SWG, Hursley, UK

smh@uk.ibm.com
tel:+44-1962-815848

----- Forwarded by Steve Hanson/UK/IBM on 28/07/2014 11:31 -----


1) Proposed new dfdl:lengthKind 'fixedLengthOrTerminated'.  


A new enum implies that it can be used in any scenario, so the following need to be specified.

My use case needs only constants as the maximum, hence enum name contains "fixed" prefix, not "explicit". Since a terminator must be set, then these cannot be "none" or "initiator". 
SMH: Doesn't follow. Today, if I specify a terminator, it must be present, modulo EVDP/NVDP. So why is the same not true for the new enum? If we add a new enum, it has to work in a way that is consistent with other lengthKinds and not just for MIL-STD-2045 use cases. I have no use case that requires this for complex types at all.
Perhaps we can dodge this by having it be simpleFixedLengthOrTerminated, and restricting it to simple types only. ?

SMH: Perhaps, but that makes this lengthKind enum different from all the others, and that doesn't seem right.  So there's plenty to think about with this new dfdl:lengthKind. A good rule for deciding whether a new dfdl:length or dfdl:occursCountKind should be added is whether it bends some other part of the spec out of shape. The new dfdl:lengthKind looks ok so far.  

However we *think* we have come up with an alternative model which is simpler than you one you state in the document. Example for field 'varstr' with max length 100:


<xs:sequence dfdl:terminator="{if (fn:str-len(varstr) eq 100) then '%ES;' else '%DEL'}" ...>

        <xs:element name="varstr" type="xs:string" dfdl:lengthKind="pattern" dfdl:pattern="([^\x7F].\x7F)|(.{100})" ... />

</xs:sequence>


Can't put dfdl:terminator with a self-referencing expression on the element. Might need fn:exists in the dfdl:terminator expression to handle optionality. Does that work?


I don't think this will work as %ES isn't allowed in terminators.
There is a proposal to allow it, but only when length kind is such that one is not scanning for delimiters (same restriction as for WSP*). Let's assume that we allow %ES for now.
SMH: This has been incorporated as an update to erratum 2.148 and is the latest spec draft.

One beauty of your idea here is that unparsing will "just work", so that's nice.

But I think your pattern has a bug: I think it should be dfdl:pattern="[^\x7F]{0,99}(?=\x7F)| .{100}"

This will not capture more than 99 characters prior to the DEL, and will not include the DEL as part of the string in the case where a DEL is found (uses lookahead in regex). Hence, the DEL will be available to be picked off as the terminator. Without this you end up with the DEL in the payload.
With that I think your approach would work. So thanks for that idea.

SMH: Yes my pattern was wrong, thanks for correcting.

SMH: Also realised that the dfdl:terminator expression is illegal, as it looks downwards. The correct DFDL is:

<xs:sequence ...>
        <xs:element name="varstr" type="xs:string" dfdl:lengthKind="pattern" dfdl:pattern="[^\x7F]{0,9}(?=\x7F)|.{10}" ... />
       <xs:sequence dfdl:terminator="{if (fn:string-length(./varstr) eq 10) then '%ES;' else '%DEL'}" .../>
</xs:sequence>

I have tested this (using {if (fn:string-length(./varstr) eq 10) then '%WSP*;' else '%DEL;'} as %ES; not yet allowed in terminator) and it works ok both parse and unparse.

It was noted that if the terminator expression was allowed to refer to the value of its own element then this could be simplified to:

        <xs:element name="varstr" type="xs:string" dfdl:lengthKind="pattern" dfdl:pattern="[^\x7F]{0,9}(?=\x7F)|.{10}"  
                         dfdl:terminator="{if (fn:string-length(.) eq 10) then '%ES;' else '%DEL'}" .../>

Clearly this relaxation could only occur when lengthKind was not delimited. (That is, the same condition that we have proposed allowing %ES; for terminator/separator). But I think it also violates the known-to-exist rules ? Certainly IBM DFDL says it can't find '.' in the infoset when I tried this. So perhaps this is not a good idea.


2) Proposed new dfdl:occursCountKind 'prefixed'.  

The motivation here is to avoid the explosion of global groups needed for the hidden presence indicators. It was observed that a single global group could be used if the expression used a predicate when referring to the FPI element, though obviously that makes the schema very fragile.

At first glance the new enum would appear to be symmetric with lengthKind 'prefixed', but on closer examination this is not true:
        Array = [ [PrefixOccursCount Separator] EnclosedElement [ Separator EnclosedElement ]*  [ Separator StopValue] ]         PrefixOccursCount = SimpleNormalRep
       
It would be wrong to couple the prefix more tightly to the first occurrence (by more tightly I mean like prefix length where the length occurs after the element's left framing region). When parsing, if the value is 0 then nothing else is expected in the data - zero occurrences, so no other DFDL properties are even examined. It must therefore occur ahead of all occurrences. If it is doing that, then it may as well have its own left and right framing, hence use of SimpleNormalRep rather than SimpleContent, and work with delimiters.

However IBM questions the need for the enum as it can also be modelled using a choice of two sequences which, if you put the discriminator on the hidden FPI element itself, means you can get away with just two global groups.  And you don't need outputValueCalc as you can just use defaults.
  ...
  <!-- Element unit_name -->
  <xs:choice>
  <xs:sequence>
    <xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_true" />
    <xs:element name="unit_name" type="..." ... />
  </xs:sequence>
  <xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_false" />
</xs:choice>        
   <!-- Element unit_type -->
    <xs:choice>
  <xs:sequence>
    <xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_true" />
    <xs:element name="unit_type" type="..." ... />
  </xs:sequence>
  <xs:sequence dfdl:hiddenGroupRef="vmdfdl:gh_mil_std_2045_FPI_false" />
</xs:choice>        
  ...
 
  <xs:group name="vmdfdl:gh_mil_std_2045_FPI_true" >
  <xs:sequence>
    <xs:element name="FPI" type="xs:boolean" default="true" ... >
        <dfdl:discriminator test="{. eq fn:true()}"
    </xs:element>
  </xs:sequence>
</xs:group>        
 
  <xs:group name="vmdfdl:gh_mil_std_2045_FPI_false" >
  <xs:sequence>
    <xs:element name="FPI" type="xs:boolean" default="false" ... >
    </xs:element>
  </xs:sequence>
</xs:group>        


  3) Proposed new dfdl:occursCountKind 'repeatUntil'.

It seems to IBM that the only practical effect of the new enum 'repeatUntil' is to simplify the discriminator. It doesn't remove it nor does it remove the need for the hidden FRI element. IBM does not see the benefit of the new enum in its proposed form. Further...

That simplifies the discriminator to the following and means you only need one global group for FRI.

<dfdl:discriminator>
        if (dfdl:occursIndex() eq 1 then fn:true() else ../<array>[dfdl:occursIndex()-1]/vmfdfdl:gh_mil_std_2045_FRI
<dfdl:discriminator>

For that to work on unparsing there needs to be a generic way to set the (Boolean) FRI from within the hidden group.  Something like

        dfdl:outputValueCalc="{dfdl:occursIndex() eq fn:count(..)}"

There is a problem with this though. The property is on the FRI element so what does dfdl:occursIndex() return? The spec says it returns "the position of the current item within an array" but also says "this function may be used on non-array elements". I'm not clear what it would return for the latter case - does it return 1 or does it look back to its parent or  ... ? Here we want the index of the parent. Perhaps this function needs to take an argument to be unambiguous, eg, . or .. or ../.., ie, it can only refer back up to the root.  (In fact this problem applies whether or not there is a single FRI or one per array).

A counter proposal...

One way to really simplify this type of occurrence indicator is to consider it as part of the element, in the same way as a length prefix. This tight binding makes sense here, because there is an indicator per occurrence.

        dfdl:occursCountKind="stopIndicator' dfdl:occursStopIndicatorType="<type>"

The stop indicator type must be derived from xs:boolean. True means the occurrence is the last. False means it is not. Or we can do it the other way round) The DFDL Boolean properties of the type can always be used to compensate. The parser would work a bit like it does for 'stopValue' - it keeps parsing speculatively until it finds an occurrence which indicates the end of the array - the difference being that in this case it is added to the infoset.  The oddity about this is that it applies to arrays only and does not work with optional elements, so it can not be used with minOccurs = '0'.

Grammar becomes:

SimpleNormalRep = LeftFraming StopIndicator PrefixLength SimpleContent RightFraming
ComplexNormalRep = LeftFraming StopIndicator PrefixLength ComplexContent ElementUnused RightFraming

StopIndicator = SimpleContent

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU