[DFDL-WG] output value and length (was Re: Fw: Notes from 2007-09-12 call)

20 Sep 2007

      We have many use cases to work out for the output direction.

E.g., consider a string in utf-8 characters, stored in a box which must be 
of N "words" long, i.e., length will be a multiple of 4 bytes long. 

Now suppose we have to store the length of the box measured in number of 
words, in a field L1. The String is S1.

Some of this stuff might want to be hidden in a real schema, but let's 
ignore that for now. So, one might model this without DFDL as:

<sequence>
   <element name="L1" type="int"  />
   <element name="box">
         <complexType>
              <sequence id="box">
                    <element name="S1" type="string"  />
              </sequence>
          </complexType>
    </element>
</sequence>

So we have the length, a box surrounding the string, and the string S1 
itself.

Now we want to annotate this for input parsing. I'm going to leave off all 
the dfdl:applies properties to save space:

<sequence>
   <element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes" 
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
      />
   <element name="box">
         <complexType>
              <sequence dfdl:length="{ ../L1 * 4 }" 
dfdl:lengthUnits="bytes">
                    <element name="S1" type="string" dfdl:encoding="utf-8" 
dfdl:length="fillAvailableSpace" />
              </sequence>
          </complexType>
    </element>
</sequence>

So far so good. The sequence's length is L1 * 4, and the string fills the 
space in that sequence.

Now we want to annotate it for output/unparse. First we put in 
outputValueCalc on L1. This seems ok.

<sequence>
   <element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes" 
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
     dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" />
   <element name="box">
         <complexType>
              <sequence dfdl:length="{ ../L1 * 4 }" 
dfdl:lengthUnits="bytes">
                    <element name="S1" type="string" dfdl:encoding="utf-8" 
dfdl:length="fillAvailableSpace" />
              </sequence>
          </complexType>
    </element>
</sequence>

The above however appears to be circularly defined. The length of the 
sequence inside the box element is defined in terms of the value of L1, 
and the output value of L1 is defined in terms of the length of element 
box. So really we need to distinguish input length and output length 
calculations. 

So it seems we need dfdl:outputLengthCalc="{ cieling(S1.length('bytes'), 
4) * 4 }" as an additional rep prop on the box sequence.  Notice how we've 
had to ask for the length to be presented in a particular kind of units, 
and the cieling and multiply trick rounds up to a multiple of 4 in size. 

<sequence>
   <element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes" 
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
     dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" />
   <element name="box">
         <complexType>
              <sequence dfdl:length="{ ../L1 * 4 }" 
dfdl:lengthUnits="bytes" dfdl:outputLengthCalc="{ 
cieling(S1.length('bytes'), 4) * 4 }">
                    <element name="S1" type="string" dfdl:encoding="utf-8" 
dfdl:length="fillAvailableSpace" />
              </sequence>
          </complexType>
    </element>
</sequence>

But now we still have an issue, which is that the length of S1 on output 
might need to be enlarged with padding characters because the output 
length of the box is being rounded up to a multiple of 4 bytes.

One idea for how to solve this is to use layers. I.e, we need another 
string S2 because we can't get all the description we need onto just the 
string S1.

<sequence>
   <element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes" 
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
     dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" />
   <element name="box">
         <complexType>
              <sequence dfdl:length="{ ../L1 * 4 }" 
dfdl:lengthUnits="bytes" dfdl:outputLengthCalc="{ 
cieling(S1.length('bytes'), 4) * 4 }">
                    <element name="S2" type="string" dfdl:encoding="utf-8" 
dfdl:length="fillAvailableSpace" 
                                        dfdl:outputValueCalc="{ ../../S1 
}"  dfdl:padCharacter=" " />
              </sequence>
          </complexType>
    </element>
   <element name="S1" type="string" dfdl:inputValueCalc="{ ../box/S2 }" />
</sequence>

The above we have S2, which is the string that really lives in the 
representation.

Now hiding the rep stuff and making it into a reusable type definition:

<complexType name="wordLengthStringType">
  <sequence>
    <annotation><appinfo><dfdl:hidden>
      <element name="rep">
        <complexType>
          <sequence>
             <element name="L1" type="int" dfdl:length="4" 
dfdl:lengthUnits="bytes" 
 dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
 dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" />
             <element name="box">
               <complexType>
                  <sequence dfdl:length="{ ../L1 * 4 }" 
dfdl:lengthUnits="bytes" dfdl:outputLengthCalc="{ 
cieling(../../../S1.length('bytes'), 4) * 4 }">
                    <element name="S2" type="string" dfdl:encoding="utf-8" 
dfdl:length="fillAvailableSpace" 
                                        dfdl:outputValueCalc="{ 
../../../S1 }"  dfdl:padCharacter=" " />
                  </sequence>
                </complexType>
             </element>
          </sequence>
        </complexType>
      </element>
    </dfdl:hidden></appinfo></annotation>
    <element name="S1" type="string" dfdl:inputValueCalc="{ ../rep/box/S2 
}" />
  </sequence>
</complexType>

Now to use it:

<element name="myString" type="wordLengthStringType"/>

Logical expression myString/S1 is the string's value. (Probably should 
rename the element "S1" to "value" so this would be myString/value)

In DFDL v1.0 as currently defined, we do not have any way to make this 
into a "real string type", because we don't provide a way to define a 
complex type as the representation of a simple type. That's ok. We can 
consider that later. 

Conclusion:

It does appear that we need outputLengthCalc, which is tantamount to 
Steve's concerns that we need input and output variants of many 
properties. We need to distinguish input length and output length. In the 
above example, dfdl:length is input length, and dfdl:outputLengthCalc is 
the property name I'm using for an output length. 

Perhaps better naming conventions would be

Use dfdl:length when it's symmetric, dfdl:inputLength and 
dfdl:outputLength when it's asymetric.

Logical value comes from the representation when parsing unless 
dfdl:inputValue (formerly dfdl:inputValueCalc) in which case the logical 
value comes from that expression.

Representation comes from the logical value when unparsing unless 
dfdl:outputValue is provided (formerly dfdl:outputValueCalc), in which 
case representation comes from that computed value instead.

We also need the expression language to be able to ask what the length of 
the representation of an element is, measured in whatever units we need. 
We may need to be able to ask for the inputLength and the outputLength 
separately.

[DFDL-WG] output value and length (was Re: Fw: Notes from 2007-09-12 call)

Mike Beckerle