June 2011 - dfdl-wg - lists.ogf.org

Re: [DFDL-WG] Issues to add to work items list
by Steve Hanson 15 Jun '11

15 Jun '11

Thanks Mike. I have added a couple of comments below...we can discuss fully on the calls. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com> To: Steve Hanson/UK/IBM@IBMGB Date: 15/06/2011 01:59 Subject: Issues to add to work items list Steve, Below is the list I have so far of items in the spec that go beyond just typos - where re-wording is required or advisable. All but the last of these is tied up with the length & delimiters issue, but separate of the "known to exist" vs. "missing" topic. I've not bothered to provide anything about the section on "known to exist" and such. That section is already the discussion of a work item/topic. The last one is just a nit about BOMs. ...mikeb -------------------------------------------------------------------------------------------- 5.2.2 MinLength, MaxLength These facets are used: When dfdl:lengthKind=”implicit”. In that case the length is given by the value of xs:maxLength. In this case minLength if specified is required to be equal to maxLength (schema definition error otherwise). For validation of variable length string elements. It is a processing error when a fixed-length string is found to have a number of characters not equal to the fixed number. For example, if a fixed-length string also has delimiters we might be able to successfully separate it from the surrounding elements depending on the delimiter specifications; however, if the length of the fixed-length string is not equal to the number specified as the fixed length then it is a processing error (not simply a validation error).[MB1] [MB1]Contradicts statement that scanning for delimiters is off.(Discussed where dfdl:lengthKind=’explicit’ is described) What is a fixed length string? Clearly if it has lengthKind=”explicit” it is fixed length. What if it has lengthKind=”implicit” and maxLength=”10”. Is that a fixed length string which shuts off delimiter scanning also? If so then this paragraph is erroneous and misleading. <SMH>Yes, Tim and I have noted that this paragraph needs revising depending on the outcome of action 139 <SMH> -------------------------------------------------------------------------------------- 9.2 DFDL Syntax Grammar Change to introduce concept of EnclosedItem or ChildItem (I used EnclosedItem below): Sequence = LeftFraming SequenceContent RightFraming SequenceContent = [ PrefixSeparator EnclosedItem [ Separator EnclosedItem ]* PostfixSeparator ] FinalUnusedRegion EnclosedItem = Element | Array | ComplexContent[MB1] Choice = LeftFraming ChoiceContent RightFraming ChoiceContent = [ EnclosedItem ] FinalUnusedRegion[MB2] [MB1]Refactored to share the EnclosedItem concept to Choices also. Should perhaps be named ChildItem. This is useful when discussing how parsing, defaulting, etc. work as well. <SMH>Seems fine, it is effectively just a renaming<SMH> --------------------------------------------------------------------------------------------------------------------------------------------------------------------- Table 14 Implicit Alignment in bits Note: Specifying the implicit alignment in bits does not imply that dfdl:lengthUnits 'bits' can be specified for all simple types.[MB1] [MB1]I do not understand this comment. What exactly is the restriction? <SMH>It is really saying that alignmentUnits and lengthUnits are independent and have their own rules for when they are applicable. <SMH> ----------------------------------------------------------------------------------------------------------------------------------------------------------------------- lengthKind Enum Controls how the representation length of the component is determined. Valid values are: 'explicit', 'delimited', 'prefixed', 'implicit', 'pattern', 'endOfParent' A full description of each enumeration is given in the later sections. 'explicit' means the length of the item is given by the dfdl:length property 'delimited' means the item is delimited by a terminator or separator[MB1] ‘prefixed’ means the length of the item is given by an immediately preceding prefix field specified using prefixLengthType. ‘implicit means the length is to be determined in terms of the type of the element and its schema-specified properties if any. ‘pattern’ means the length of the item is given by a regular expression specified using the dfdl:lengthPattern property. ‘endOfParent’ means that the item is terminated by the termination of the containing construct. Annotation: dfdl:element, dfdl:simpleType [MB1]To me this is a very strong statement. It means that an outside-in parse is allowed where for a sequence, we can scan and determine its end, and then parse the children. It requires that “scan” is a well-defined concept for the contents of the sequence. It means there can be nothing inside which requires the suspension of scanning. It means contained elements that have length explicit and representation binary, are simply not allowed. <SMH>Not entirely true, we allow some binary types to have delimited lengthKind (eg, BCD and Packed Decimals). There are formats out there that require this. <SMH> To me it also means you cannot change the character set encoding, or have a contained element that itself uses an overlapping set of delimiters with the enclosing group’s delimiters. – Is that going too far? – Delimited is like pattern. It restricts what is in the data stream substantially. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------- The rules for resolving ambiguity between delimiters are: 1. When two delimiters have a common prefix, the longest delimiter has precedence. 2. When two delimiters have exactly the same value, the innermost (most deeply nested) delimiter has precedence. 3. When the separator and terminator on a group have the same value, the separator has precedence.[MB1] [MB1]By precedence, this must mean it is tried first, but the parser might backtrack and assume it to be a terminator instead. This seems problematic to me. I’d like to either rule this out and say they can’t be the same, or see a use case where this is needed, a backtracking parser can parse it, and there’s no more reasonable way to structure the schema. ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ 12.3.5.1 Pattern-Based Lengths - Scanability Any element (complex, simple text, simple binary) may have a dfdl:lengthKind 'pattern' as long as the bytes in the content region of the element are legal in the stated encoding of that element. Where a complex element has children with binary representation in practice this means an 8-bit ASCII encoding. [MB1] [MB1]Not necessarily ASCII. An 8-bit encoding such that any byte value is valid is the real requirement. The point is no single byte value is invalid, and no combinations of adjacent byte values are invalid so that any binary data won’t trip up a character conversion and subsequent scan. Hmmm. The 8-bit character set used must have a transformation into Unicode which is bijective and information preserving. I.e., a unique Unicode character for each code point, and no “invalid” chracters which have no corresponding Unicode value. However, Ascii-based sets like 8859-1 are not strictly speaking, required. ----------------------------------------------------------------------------------------------------------------------------------------------- 12.3.7.1.3 Byte Order Mark If a byte-order mark codepoint appears at the start of a UTF-8, [MB1] UTF-16 or UTF-32 encoded string then the byte-order mark will be included as part of the string payload[1]. That is, for the UTF-8, UTF-16 and UTF-32 character encodings, a byte-order-mark codepoint is treated as a character of the string in DFDL and contributes to the length. A way of eliminating the byte-order mark so that it does not end up in the infoset is that the byte-order mark can be modeled as a separate element before the string. This BOM element can be either required or optional depending on whether one is expected or optional at the beginning of the string. [1] Byte-order marks are explicitly stated to be “not characters” in the Unicode standard. [MB1]No such thing as a BOM codepoint in a UTF-8 string. A UTF-8 byte sequence might encode the character code for a BOM, but this would be a meaningless inclusion of a BOM character code in a context where it will never be interpreted. I suggest that we drop the term UTF-8 here, and BOM’s that get encoded when they are interpreted as character codes, and translated by the UTF-8 encoding algorithm into a multi-byte UTF-8 byte sequence, is handled the same way as other non-characters, i.e., what do we do when a high or low surrogate codepoint is present and we’re to encode as UTF-8. I think the answer is we run the UTF-8 encode/decode algorithm, and whatever Unicode character code it creates is what it creates, and if that happens to come out as any of the non-characters (BOMs, surrogates, others perhaps), so be it. The topic is about Unicode non-characters, not specifically BOMs. The general topic is encoding/decoding our infoset Unicode character codes which have no real representation in the specified encoding. Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

OGF DFDL WG Call Agenda 2011-06-15
by Steve Hanson 14 Jun '11

14 Jun '11

Please find agenda for the above call on GridForge at: http://forge.gridforum.org/sf/docman/do/downloadDocument/projects.dfdl-wg/d… Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

OGF DFDL WG Call Minutes 2011-06-08
by Steve Hanson 08 Jun '11

08 Jun '11

Please find minutes of the above meeting on GridForge at: http://forge.gridforum.org/sf/docman/do/downloadDocument/projects.dfdl-wg/d… Regards Steve Hanson Architect, DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848

1 0

OGF DFDL WG Call Agenda 2011-06-08
by Steve Hanson 08 Jun '11

08 Jun '11

Please find agenda for the above call on GridForge at: http://forge.gridforum.org/sf/docman/do/downloadDocument/projects.dfdl-wg/d… Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

Rescheduled: OGF DFDL Working Group weekly call (8 Jun 15:00 GDT in Hursley DE2J18/UK/IBM)
by Steve Hanson 08 Jun '11

08 Jun '11

Note new dial-in details. Passcode for Participants: 5381214 Canada Toll-Free 888-426-6840 China Toll-Free 10-800-711-1071 CHINA NETCOM GROUP USERS China Toll-Free 10-800-110-0996 CHINA TELECOM SOUTH USERS France Toll-Free 0800-94-0558 Germany Toll-Free 0800-000-1018 India Toll-Free 000-800-100-1176 Ireland Toll-Free 1-800-943-427 Israel Toll-Free 1-809-417-783 United Kingdom Caller Paid 0-20-30596451 United Kingdom Toll-Free 0800-368-0638 USA Caller Paid 215-861-6239 USA Toll-Free 888-426-6840 Other international numbers available - e-mail smh(a)uk.ibm.com. Sorry to muck you around. My clashes have all been cancelled at short notice, so we can go ahead with the DFDL WG call at the usual time.

1 0

Rescheduled: OGF DFDL Working Group weekly call (9 Jun 16:00 GDT in Hursley DE2J18/UK/IBM)
by Steve Hanson 07 Jun '11

07 Jun '11

Note new dial-in details. Passcode for Participants: 5381214 Canada Toll-Free 888-426-6840 China Toll-Free 10-800-711-1071 CHINA NETCOM GROUP USERS China Toll-Free 10-800-110-0996 CHINA TELECOM SOUTH USERS France Toll-Free 0800-94-0558 Germany Toll-Free 0800-000-1018 India Toll-Free 000-800-100-1176 Ireland Toll-Free 1-800-943-427 Israel Toll-Free 1-809-417-783 United Kingdom Caller Paid 0-20-30596451 United Kingdom Toll-Free 0800-368-0638 USA Caller Paid 215-861-6239 USA Toll-Free 888-426-6840 Other international numbers available - e-mail smh(a)uk.ibm.com. Have to reschedule this week. Please let me know if this causes a problem.

1 0

Fw: Mapping from DFDL 1.0 to XDM
by Steve Hanson 03 Jun '11

03 Jun '11

The OGF editors have put the document into 'Public Comment' stage. Please can I encourage you to download and review. http://forge.ogf.org/sf/go/artf6480. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 03/06/2011 09:10 ----- From: Steve Hanson/UK/IBM To: dfdl-wg(a)ogf.org Date: 01/06/2011 10:51 Subject: Mapping from DFDL 1.0 to XDM DFDL WG has published a document on this topic here: http://forge.gridforum.org/sf/docman/do/downloadDocument/projects.dfdl-wg/d… Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0

OGF DFDL WG Call Minutes 2011-06-01
by Steve Hanson 01 Jun '11

01 Jun '11

Please find minutes of the above meeting on GridForge at: http://forge.gridforum.org/sf/docman/do/downloadDocument/projects.dfdl-wg/d… Regards Steve Hanson Architect, DFDL Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848

1 0

Mapping from DFDL 1.0 to XDM
by Steve Hanson 01 Jun '11

01 Jun '11

DFDL WG has published a document on this topic here: http://forge.gridforum.org/sf/docman/do/downloadDocument/projects.dfdl-wg/d… Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh(a)uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

1 0