
Remember our call today is 1pm Eastern time (us and canada) because of the daylight savings time change in the UK. ---------------------------- We have two topics suggested for discusssion today. 1) separator treatment in latest circulated proposal-to-simplify-nulls.... doc/memo - specifically, it is both buggy and unclear. How to fix. Who to fix? 2) boxed data. I.e., sequences with length as a way to describe a box of some size that is a container for data held in the box. This is in response to a note circulated to this WG before. Excerpts below. Specifically: should we punt this for v1.0. It is after all, a kind of layering, and we punted on layering largely for v1.0. -------------------------- Conclusion: It does appear that we need outputLengthCalc, which is tantamount to Steve's concerns that we need input and output variants of many properties. We need to distinguish input length and output length. In the above example, dfdl:length is input length, and dfdl:outputLengthCalc is the property name I'm using for an output length. Perhaps better naming conventions would be Use dfdl:length when it's symmetric, dfdl:inputLength and dfdl:outputLength when it's asymetric. Logical value comes from the representation when parsing unless dfdl:inputValue (formerly dfdl:inputValueCalc) in which case the logical value comes from that expression. Representation comes from the logical value when unparsing unless dfdl:outputValue is provided (formerly dfdl:outputValueCalc), in which case representation comes from that computed value instead. --------------------------- Also, please could you illustrate the issue using a much simpler example than the box array, eg, a variable length string where its length is given by a preceding integer field. Length on input is given by the integer, length on output is given by the actual data. Value of integer is as supplied on input, value on output is the length of the string. I want to see whether we really need input & output length properties for a more typical scenario. Personally I think we should drop the use of dfdl:length on a sequence for 1.0 period. That precludes support of box arrays, but I don't have a problem with that as I don't have a real-life use case. (I would say however that if the only way to model a box array is as below, then we are asking an awful lot from our audience to be able to create such a model.) Regards, Steve Steve Hanson WebSphere Message Brokers Hursley, UK Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848 Mike Beckerle <beckerle@us.ibm.com> Sent by: dfdl-wg-bounces@ogf.org 20/09/2007 15:02 To Alan Powell/UK/IBM@IBMGB cc dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org Subject [DFDL-WG] output value and length (was Re: Fw: Notes from 2007-09-12 call) We have many use cases to work out for the output direction. E.g., consider a string in utf-8 characters, stored in a box which must be of N "words" long, i.e., length will be a multiple of 4 bytes long. Now suppose we have to store the length of the box measured in number of words, in a field L1. The String is S1. Some of this stuff might want to be hidden in a real schema, but let's ignore that for now. So, one might model this without DFDL as: <sequence> <element name="L1" type="int" /> <element name="box"> <complexType> <sequence id="box"> <element name="S1" type="string" /> </sequence> </complexType> </element> </sequence> So we have the length, a box surrounding the string, and the string S1 itself. Now we want to annotate this for input parsing. I'm going to leave off all the dfdl:applies properties to save space: <sequence> <element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes" dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger" /> <element name="box"> <complexType> <sequence dfdl:length="{ ../L1 * 4 }" dfdl:lengthUnits="bytes"> <element name="S1" type="string" dfdl:encoding="utf-8" dfdl:length="fillAvailableSpace" /> </sequence> </complexType> </element> </sequence> So far so good. The sequence's length is L1 * 4, and the string fills the space in that sequence. Now we want to annotate it for output/unparse. First we put in outputValueCalc on L1. This seems ok. <sequence> <element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes" dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger" dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" /> <element name="box"> <complexType> <sequence dfdl:length="{ ../L1 * 4 }" dfdl:lengthUnits="bytes"> <element name="S1" type="string" dfdl:encoding="utf-8" dfdl:length="fillAvailableSpace" /> </sequence> </complexType> </element> </sequence> The above however appears to be circularly defined. The length of the sequence inside the box element is defined in terms of the value of L1, and the output value of L1 is defined in terms of the length of element box. So really we need to distinguish input length and output length calculations. So it seems we need dfdl:outputLengthCalc="{ cieling(S1.length('bytes'), 4) * 4 }" as an additional rep prop on the box sequence. Notice how we've had to ask for the length to be presented in a particular kind of units, and the cieling and multiply trick rounds up to a multiple of 4 in size. <sequence> <element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes" dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger" dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" /> <element name="box"> <complexType> <sequence dfdl:length="{ ../L1 * 4 }" dfdl:lengthUnits="bytes" dfdl:outputLengthCalc="{ cieling(S1.length('bytes'), 4) * 4 }"> <element name="S1" type="string" dfdl:encoding="utf-8" dfdl:length="fillAvailableSpace" /> </sequence> </complexType> </element> </sequence> But now we still have an issue, which is that the length of S1 on output might need to be enlarged with padding characters because the output length of the box is being rounded up to a multiple of 4 bytes. One idea for how to solve this is to use layers. I.e, we need another string S2 because we can't get all the description we need onto just the string S1. <sequence> <element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes" dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger" dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" /> <element name="box"> <complexType> <sequence dfdl:length="{ ../L1 * 4 }" dfdl:lengthUnits="bytes" dfdl:outputLengthCalc="{ cieling(S1.length('bytes'), 4) * 4 }"> <element name="S2" type="string" dfdl:encoding="utf-8" dfdl:length="fillAvailableSpace" dfdl:outputValueCalc="{ ../../S1 }" dfdl:padCharacter=" " /> </sequence> </complexType> </element> <element name="S1" type="string" dfdl:inputValueCalc="{ ../box/S2 }" /> </sequence> The above we have S2, which is the string that really lives in the representation. Now hiding the rep stuff and making it into a reusable type definition: <complexType name="wordLengthStringType"> <sequence> <annotation><appinfo><dfdl:hidden> <element name="rep"> <complexType> <sequence> <element name="L1" type="int" dfdl:length="4" dfdl:lengthUnits="bytes" dfdl:byteOrder="bigEndian" dfdl:representation="binaryInteger" dfdl:outputValueCalc="{ cieling(../box.length(), 4) }" /> <element name="box"> <complexType> <sequence dfdl:length="{ ../L1 * 4 }" dfdl:lengthUnits="bytes" dfdl:outputLengthCalc="{ cieling(../../../S1.length('bytes'), 4) * 4 }"> <element name="S2" type="string" dfdl:encoding="utf-8" dfdl:length="fillAvailableSpace" dfdl:outputValueCalc="{ ../../../S1 }" dfdl:padCharacter=" " /> </sequence> </complexType> </element> </sequence> </complexType> </element> </dfdl:hidden></appinfo></annotation> <element name="S1" type="string" dfdl:inputValueCalc="{ ../rep/box/S2 }" /> </sequence> </complexType> Now to use it: <element name="myString" type="wordLengthStringType"/> Logical expression myString/S1 is the string's value. (Probably should rename the element "S1" to "value" so this would be myString/value) In DFDL v1.0 as currently defined, we do not have any way to make this into a "real string type", because we don't provide a way to define a complex type as the representation of a simple type. That's ok. We can consider that later. Conclusion: It does appear that we need outputLengthCalc, which is tantamount to Steve's concerns that we need input and output variants of many properties. We need to distinguish input length and output length. In the above example, dfdl:length is input length, and dfdl:outputLengthCalc is the property name I'm using for an output length. Perhaps better naming conventions would be Use dfdl:length when it's symmetric, dfdl:inputLength and dfdl:outputLength when it's asymetric. Logical value comes from the representation when parsing unless dfdl:inputValue (formerly dfdl:inputValueCalc) in which case the logical value comes from that expression. Representation comes from the logical value when unparsing unless dfdl:outputValue is provided (formerly dfdl:outputValueCalc), in which case representation comes from that computed value instead. We also need the expression language to be able to ask what the length of the representation of an element is, measured in whatever units we need. We may need to be able to ask for the inputLength and the outputLength separately. -- dfdl-wg mailing list dfdl-wg@ogf.org http://www.ogf.org/mailman/listinfo/dfdl-wg Comments on lengths <awp> below </awp> Alan Powell MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England Notes Id: Alan Powell/UK/IBM email: alan_powell@uk.ibm.com Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898 Steve Hanson/UK/IBM@IBMGB Sent by: dfdl-wg-bounces@ogf.org 19/09/2007 14:30 To dfdl-wg@ogf.org cc Subject [DFDL-WG] Fw: Notes from 2007-09-12 call More on expressions, <smh>below</smh> Regards, Steve Steve Hanson WebSphere Message Brokers Hursley, UK Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 19/09/2007 14:14 ----- Mike Beckerle <beckerle@us.ibm.com> 19/09/2007 13:43 To Steve Hanson/UK/IBM@IBMGB cc dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org Subject Re: [DFDL-WG] Notes from 2007-09-12 call Comments below in BLUE Steve Hanson <smh@uk.ibm.com> Sent by: dfdl-wg-bounces@ogf.org 09/19/2007 06:04 AM To dfdl-wg@ogf.org cc Subject Re: [DFDL-WG] Notes from 2007-09-12 call Some thoughts since last week's call: 1) Expression language We've not thought much about how expressions will work on output. It's fine to say something like dfdl:length="..\count+1" when parsing, but what happens on output. I think we should not try to reverse engineer expressions, and rely on the user to set output fields correctly. So, taking my example, on output we would assume count had been set by the user, apply the expression to calculate the intended length of data, then apply padding etc rules as needed. Can we generalise that philosophy across all our uses of expressions? If we can't then perhaps that places a bound on the actual uses of expressions that we permit. Inverting will generally not be possible. Just make the example dfdl:length="{ ../count * ../scale + 1 }" How do you split up the length into count and scale? <smh>Agree</smh> In your example, I would expect the count field to have an outputValueCalc="{ ../x.length() - 1 }" (I'm assuming the field with the length calculation formula is named "x".) <smh>Doesn't outputValueCalc mean that we are deriving the count from the length of the data supplied for x? That forces the user to pad x to the correct value, in order to derive count. Which is not how we want things to work. We want count to define the length of x, so the DFDL serialiser can pad x according to other DFDL properties. Maybe I'm missing something about input/outputValueCalc?</smh> <awp> There are multiple cases to consider. There are some formats that require the length field to be the physical length of a structure, ie including padding, code page considerations, etc that it is impossible for the user to know. For example IMS transaction header has LLbbHeaderData. DFDL should fill in these lengths. I would assume that in most cases a field with it's length in other field is variable length but again it may need to be the physical length. I tend to agree with Mike that outvaluecalc can be used to set the length field but how do we get to ignore the dfdl:length specification on the data field? I hope this doesn't mean that we need to distinguish between logical and physical lengths in the expression language. </awp> In general, when something uses something else in it's calculation (length or just the value - inputValueCalc), then the inverse is outputValueCalc on the contributing parts. 2) dfdl:length for sequences We have three cases here: a) Empty sequence - we agreed to disallow this b) Non-empty normal sequence - what does the length mean here? It means the box is potentially larger than the contents. If it isn't at least as big it's an error. If these lengths are data dependent it could be a processing error. Otherwise a schema definition error. Draft 025 discusses this in the part on sequences with length. We also could disallow this case if we want for now, knowing we could add it back if we want. One can always convert this case into the one below by wrapping the sequence's child elements in an array element, with array occurrences determined by "fillAvailableSpace" policy. If we allow this at all, I think this should be the way we explain the semantics of it. (Though with the inserted array the paths would all change which is undesirable. - so we would say it works like this, but without the paths being changed...) c) Non-empty sequence used as box array - the motivating scenario I think we should also disallow b). If we are disallowing a) on the grounds of not using sequence with a length to model opaque data then we should also disallow b). Regards, Steve Steve Hanson WebSphere Message Brokers Hursley, UK Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848
participants (1)
-
Mike Beckerle