Re: [DFDL-WG] endOfData - was: RE: FW: MIke's notes from call on 2008-08-13

I agree that "end" or whatever we decide to call it should be reserved for the last object in a sequence. I prefer "endOfParent". I have a general unease around the lengthKind enum "implicit". It originally meant something quite specific, the length was derived from the underlying xsd. That's now been extended for text decimals to mean derived from the textNumberPattern pattern length. And for a sequence to mean derived from the length of its children. I think we are overloading it. I think that "implicit" should be reserved for simple elements only, with its current semantic. And we should come up with a new enum, reserved for complex elements or sequences only, suggest "children" (given I have also suggested "endOfParent") or maybe "content". Regards Steve Hanson Programming Model Architect WebSphere Message Brokers Hursley, UK Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848 "Mike Beckerle" <mbeckerle.dfdl@gmail.com> 28/10/2008 00:48 Please respond to <mbeckerle.dfdl@gmail.com> To Steve Hanson/UK/IBM@IBMGB cc Alan Powell/UK/IBM@IBMGB Subject RE: endOfData - was: RE: FW: MIke's notes from call on 2008-08-13 Not sure where this leaves us. It is ok to reserve lengthKind="end" or "parent" or whatever for the last element of a sequence. ...mike Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc. Tel: 781-810-2100 | 504 Totten Pond Road, Waltham MA 02451 | mbeckerle.dfdl@gmail.com From: Steve Hanson [mailto:smh@uk.ibm.com] Sent: Wednesday, October 22, 2008 12:56 PM To: mbeckerle.dfdl@gmail.com Cc: Alan Powell Subject: Re: endOfData - was: RE: FW: MIke's notes from call on 2008-08-13 Mike - sorry but I think users will find this baffling. lengthKind="implicit" was intended to mean that the logical xsd provided the length. lengthKind="delimited" means that markup provides the length. We are overloading the word "implicit" and we are wrong to do so. Trying to wrap these together, and include "endOfData" (as "parent") as well, is taking the abstraction too far. It is not how people view their data. Regards Steve Hanson Programming Model Architect WebSphere Message Brokers Hursley, UK Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848 "Mike Beckerle" <mbeckerle.dfdl@gmail.com> 22/10/2008 15:29 Please respond to <mbeckerle.dfdl@gmail.com> To Alan Powell/UK/IBM@IBMGB, Steve Hanson/UK/IBM@IBMGB cc Subject endOfData - was: RE: FW: MIke's notes from call on 2008-08-13 First: apologies for missing the call today without notice. I've been solid on a rather urgent customer-related matter since before our meeting time and unable to break away. Now: w.r.t. end of data email from Steve. In the example you highlight, the reason both children of the sequence have lengthKind="endOfData" is that the parent is providing the way of determining the length, in this case using delimiters. Conceptually, the parser can carve out the box of data bytes for the first child by scanning for the separator, and the box for the 2nd child by scanning for the terminator. Then it can present those finite size boxes to the parser to parse each child, and each child consumes the entire box, i.e., to the end of (its box of) data. However, I agree the notion of "endOfData" is confusing as I have just explained it above. Perhaps the right lengthKind for a child to have when the enclosing parent has a terminator or separator is lengthKind="parent" which you can read conceptually as: "length kind for this child is determined by something specified in the parent. So you'll find nothing here about length." We could then drop the whole "endOfData" concept entirely. So in the example, both children would still have lengthKind="parent". The implied "parent" of the top level is the real true "end of the data", so a top-level element could have lengthKind="parent" also. This is an important composition property. It allows you to take a well specified format and drop it in as the description of an MQ message payload, for example. Now, lengthKind="parent" is kind of the opposite of lengthKind="implicit". "parent" is top down, i.e., from the enclosing structure. "implicit" is bottom up, i.e., length implied by the contents of the element. Here's a trick that can make this all more palatable. For certain kinds of child elements, lengthKind="implicit" will behave as lengthKind="parent". This would happen for variable length children without any way of determining the variable length "bottom up". Examples of this are: variable length text strings, variable occurrances of anything (with no way to determine how many occurrances), or ordered sequences whose final element is a variable length child without any way of determining the variable length. (This definition is recursive intentionally.) Given this, I think the DFDL fragment could be: <complexType dfdl:lengthKind="implicit" dfdl:representation="text" > // these are in the scope .... <sequence dfdl:separator=?,? dfdl:terminator=?;? dfdl:lengthKind="delimited"> <element name=?f1? type=?string? /> <element name=?f2? type=?string? /> </sequence> .... </complexType> Which I claim is what we want to have to write to capture the simple thing this is trying to express, which is the format of "string1,string2;" after all. Comments? BTW: notice my use of an enclosing complexType and ellipsis in order to achieve the notion that certain property bindings surround the example. This is one of the reasons I think we don't need a full up 2-level semantic model as Sandy suggested. I think examples like the above are sufficiently clear, particularly given the simplfied scoping. Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc. Tel: 781-810-2100 | 504 Totten Pond Road, Waltham MA 02451 | mbeckerle.dfdl@gmail.com From: Steve Hanson [mailto:smh@uk.ibm.com] Sent: Wednesday, October 22, 2008 9:09 AM To: Mike Beckerle Cc: Alan Powell Subject: Re: FW: MIke's notes from call on 2008-08-13 Hi Mike I owe a review of the "EndOfData Semantics" discussion below. The only thing that looks slightly odd in the examples below is this: <sequence dfdl:separator=?,? dfdl:terminator=?;?> <element name=?f1? type=?string? dfdl:lengthKind=?endOfData?/> <element name=?f2? type=?string? dfdl:lengthKind=?endOfData?/> </sequence> It doesn't seem right for f1 to have "endOfData". Should we have a rule that says "endOfData" is only allowed on the last object in a sequence? After all, that was its original - a way of the last thing saying it is bounded by the end of its parent. Would "endOfParent" be better than "endOfData" ? Regards Steve Hanson Programming Model Architect WebSphere Message Brokers Hursley, UK Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848 "Mike Beckerle" <mbeckerle.dfdl@gmail.com> 10/09/2008 14:10 Please respond to <mbeckerle.dfdl@gmail.com> To Steve Hanson/UK/IBM@IBMGB cc Subject FW: MIke's notes from call on 2008-08-13 Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc. Tel: 781-810-2100 | 504 Totten Pond Road, Waltham MA 02451 | mbeckerle.dfdl@gmail.com From: Mike Beckerle [mailto:mbeckerle.dfdl@gmail.com] Sent: Friday, August 15, 2008 11:53 AM To: dfdl-wg@ogf.com Subject: MIke's notes from call on 2008-08-13 Only Alan Powell and myself were on the call. These are my notes. TOPIC: Decimal Calendar ? idea: should behave as if decimal to text then text to date/time. I.e., use same date/time pattern language, but a subset of it since decimal can express nothing but digits. TOPIC: Notes to authors (at start of spec) add that we don?t do scalar type coersions/conversions generally. I.e., if the representation is a floating point, then the logical must be a floating point. If the representation is decimal, the logical must be decimal. We don?t allow you to have a logical int whose rep is decimal or vice versa. Rationale: it adds complexity that we an avoid. Doesn?t provide anything you can?t easily do another way (layering), etc. TOPIC: EndOfData Semantics: We discussed that currently we were overloading the delimited concept to include the end-of-data concept, and that was unsatisfactory and was resulting in attempts to reinject end-of-data as ?end-of-bitstream? and the like. Points - Distinguish delimited to mean we positively ARE scanning for a text pattern delimiter, and not confusing this with the end-of-data case which is fundamentally different. Avoid special-case keyword only for the ?top level? end of the data stream. This has really bad composition properties. lengthKind=?endOfData? applies to both binary and text representations. For text it means there is no terminator for this element. The enclosing construct?s length, however determined (separator, terminator, fixed, prefix, etc.) will bound length of this contained element. Case: <sequence dfdl:separator=?,? dfdl:terminator=?;?> <element name=?f1? type=?string? dfdl:lengthKind=?endOfData?/> <element name=?f2? type=?string? dfdl:lengthKind=?endOfData?/> </sequence> The above seems ok to me. Case: <sequence dfdl:lengthKind=?prefixed? dfdl:representation=?binary?> <element name=?f1? type=?int? dfdl:length=?4? dfdl:lengthKind=?explicit?> <element name=?f2? type=?hexBinary? dfdl:lengthKind=?endOfData?> </sequence> The above seems ok to me. Important use cases: Case 1: binary element at the end of a top-level sequence. <schema ?> <element name=?theTop?> <complexType> <sequence dfdl:lengthKind=?implicit?> <element name=?f1? type=?int? dfdl:length=?4? dfdl:lengthKind=?explicit?/> <element name=?f2? type=?hexBinary? dfdl:lengthKind=?endOfData?/> </sequence> </complexType> </element> </schema> In the above, the top level sequence has implicit length kind. This is ok, because the top level is assumed to be in an ?end of data? context. Case 2: deeper nesting, same implicit-length sequence. <schema ?> <element name=?NestedInside?> <complexType> <sequence dfdl:lengthKind=?implicit?> <element name=?f1? type=?int? dfdl:length=?4? dfdl:lengthKind=?explicit?/> <element name=?f2? type=?hexBinary? dfdl:lengthKind=?endOfData?/> </sequence> </complexType> </element> <element name=?stillNotTheTop?> <complexType> <sequence dfdl:lengthKind=?implicit?> ? <element ref=?NestedInside?/> </sequence> </complexType> </element> <element name=?hasFixedLength?> <complexType> <sequence dfdl:lengthKind=?explicit? dfdl:length=?100?> ? <element ref=?stillNotTheTop?/> </sequence> </complexType> </element> ?. </schema> This case illustrates how the composition properties work for explicit/implicit lengths. The definition of how this works goes something like this. When the last element of a sequence is binary with lengthKind=?endOfData? this implies that the enclosing sequence is: (a) length kind explicit or prefixed or endOfdata (b) length kind implicit ? in this case recursively this enclosing sequence must itself be enclosed in a sequence similarly constrained on length kind (cases a, b, c here) (c) the top-level sequence Note: We need to revisit whether the name ?endOfData? is desirable or not. There?s a list of alternatives from the F2F meeting. Problem is that a naïve user will be thinking ?top level? but the concept actually needs to be compositional/nestable. TOPIC: float/double ? we concluded that until XML has floating point types that can handle extended precisions that DFDL can?t handle extended precisions in any reasonable way, so we should simply say DFDL v1.0 supports only 64-bit floating point precision and 32 bit floating point precision. This narrows down float types to IEEE (single and double), and IBM390 (single and double), and maybe AS400 if that?s different and still within 64 bits precision. Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc. Tel: 781-810-2100 | 504 Totten Pond Road, Waltham MA 02451 | mbeckerle.dfdl@gmail.com Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (1)
-
Steve Hanson