We have many use cases to work out for
the output direction.
E.g., consider a string in utf-8 characters,
stored in a box which must be of N "words" long, i.e., length
will be a multiple of 4 bytes long.
Now suppose we have to store the length
of the box measured in number of words, in a field L1. The String is S1.
Some of this stuff might want to be
hidden in a real schema, but let's ignore that for now. So, one might model
this without DFDL as:
<sequence>
<element name="L1"
type="int" />
<element name="box">
<complexType>
<sequence id="box">
<element name="S1" type="string"
/>
</sequence>
</complexType>
</element>
</sequence>
So we have the length, a box surrounding
the string, and the string S1 itself.
Now we want to annotate this for input
parsing. I'm going to leave off all the dfdl:applies properties to save
space:
<sequence>
<element name="L1"
type="int" dfdl:length="4" dfdl:lengthUnits="bytes"
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
/>
<element name="box">
<complexType>
<sequence dfdl:length="{ ../L1 * 4 }" dfdl:lengthUnits="bytes">
<element name="S1" type="string"
dfdl:encoding="utf-8" dfdl:length="fillAvailableSpace"
/>
</sequence>
</complexType>
</element>
</sequence>
So far so good. The sequence's length
is L1 * 4, and the string fills the space in that sequence.
Now we want to annotate it for output/unparse.
First we put in outputValueCalc on L1. This seems ok.
<sequence>
<element name="L1"
type="int" dfdl:length="4" dfdl:lengthUnits="bytes"
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
dfdl:outputValueCalc="{
cieling(../box.length(), 4) }" />
<element name="box">
<complexType>
<sequence dfdl:length="{ ../L1 * 4 }" dfdl:lengthUnits="bytes">
<element name="S1" type="string"
dfdl:encoding="utf-8" dfdl:length="fillAvailableSpace"
/>
</sequence>
</complexType>
</element>
</sequence>
The above however appears to be circularly
defined. The length of the sequence inside the box element is defined in
terms of the value of L1, and the output value of L1 is defined in terms
of the length of element box. So really we need to distinguish input length
and output length calculations.
So it seems we need dfdl:outputLengthCalc="{
cieling(S1.length('bytes'), 4) * 4 }" as an additional rep prop on
the box sequence. Notice how we've had to ask for the length to be
presented in a particular kind of units, and the cieling and multiply trick
rounds up to a multiple of 4 in size.
<sequence>
<element name="L1"
type="int" dfdl:length="4" dfdl:lengthUnits="bytes"
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
dfdl:outputValueCalc="{
cieling(../box.length(), 4) }" />
<element name="box">
<complexType>
<sequence dfdl:length="{ ../L1 * 4 }" dfdl:lengthUnits="bytes"
dfdl:outputLengthCalc="{ cieling(S1.length('bytes'), 4) * 4 }">
<element name="S1" type="string"
dfdl:encoding="utf-8" dfdl:length="fillAvailableSpace"
/>
</sequence>
</complexType>
</element>
</sequence>
But now we still have an issue, which
is that the length of S1 on output might need to be enlarged with padding
characters because the output length of the box is being rounded up to
a multiple of 4 bytes.
One idea for how to solve this is to
use layers. I.e, we need another string S2 because we can't get all the
description we need onto just the string S1.
<sequence>
<element name="L1"
type="int" dfdl:length="4" dfdl:lengthUnits="bytes"
dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger"
dfdl:outputValueCalc="{
cieling(../box.length(), 4) }" />
<element name="box">
<complexType>
<sequence dfdl:length="{ ../L1 * 4 }" dfdl:lengthUnits="bytes"
dfdl:outputLengthCalc="{ cieling(S1.length('bytes'), 4) * 4 }">
<element name="S2" type="string"
dfdl:encoding="utf-8" dfdl:length="fillAvailableSpace"
dfdl:outputValueCalc="{ ../../S1 }" dfdl:padCharacter="
" />
</sequence>
</complexType>
</element>
<element name="S1"
type="string" dfdl:inputValueCalc="{ ../box/S2 }" />
</sequence>
The above we have S2, which is the string
that really lives in the representation.
Now hiding the rep stuff and making
it into a reusable type definition:
<complexType name="wordLengthStringType">
<sequence>
<annotation><appinfo><dfdl:hidden>
<element name="rep">
<complexType>
<sequence>
<element name="L1" type="int" dfdl:length="4"
dfdl:lengthUnits="bytes"
dfdl:byteOrder="bigEndian"
dfdl:representation="binaryInteger"
dfdl:outputValueCalc="{
cieling(../box.length(), 4) }" />
<element name="box">
<complexType>
<sequence dfdl:length="{ ../L1 * 4 }"
dfdl:lengthUnits="bytes" dfdl:outputLengthCalc="{ cieling(../../../S1.length('bytes'),
4) * 4 }">
<element name="S2" type="string"
dfdl:encoding="utf-8" dfdl:length="fillAvailableSpace"
dfdl:outputValueCalc="{ ../../../S1 }" dfdl:padCharacter="
" />
</sequence>
</complexType>
</element>
</sequence>
</complexType>
</element>
</dfdl:hidden></appinfo></annotation>
<element name="S1"
type="string" dfdl:inputValueCalc="{ ../rep/box/S2 }"
/>
</sequence>
</complexType>
Now to use it:
<element name="myString"
type="wordLengthStringType"/>
Logical expression myString/S1 is the
string's value. (Probably should rename the element "S1" to "value"
so this would be myString/value)
In DFDL v1.0 as currently defined, we
do not have any way to make this into a "real string type", because
we don't provide a way to define a complex type as the representation of
a simple type. That's ok. We can consider that later.
Conclusion:
It does appear that we need outputLengthCalc,
which is tantamount to Steve's concerns that we need input and output variants
of many properties. We need to distinguish input length and output length.
In the above example, dfdl:length is input length, and dfdl:outputLengthCalc
is the property name I'm using for an output length.
Perhaps better naming conventions would
be
Use dfdl:length when it's symmetric,
dfdl:inputLength and dfdl:outputLength when it's asymetric.
Logical value comes from the representation
when parsing unless dfdl:inputValue (formerly dfdl:inputValueCalc) in which
case the logical value comes from that expression.
Representation comes from the logical
value when unparsing unless dfdl:outputValue is provided (formerly dfdl:outputValueCalc),
in which case representation comes from that computed value instead.
We also need the expression language
to be able to ask what the length of the representation of an element is,
measured in whatever units we need.
We may need to be able to ask for the
inputLength and the outputLength separately.