----- Forwarded by Steve Hanson/UK/IBM on 19/07/2011 15:43 -----

From:	Steve Hanson/UK/IBM
To:	dfdl-wg@ogf.org
Date:	27/06/2011 17:27
Subject:	Spec issues from Mike Beckerle

For discussion on DFDL WG calls.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 27/06/2011 17:26 -----

From:	Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:	Steve Hanson/UK/IBM@IBMGB
Date:	15/06/2011 01:59
Subject:	Issues to add to work items list

Steve,
Below is the list I have so far of items in the spec that go beyond just typos - where re-wording is required or advisable. All but the last of these is tied up with the length & delimiters issue, but separate of the "known to exist" vs. "missing" topic.

I've not bothered to provide anything about the section on "known to exist" and such. That section is already the discussion of a work item/topic.

The last one is just a nit about BOMs.

...mikeb

--------------------------------------------------------------------------------------------

5.2.2 MinLength, MaxLength

These facets are used:

When dfdl:lengthKind=”implicit”. In that case the length is given by the value of xs:maxLength. In this case minLength if specified is required to be equal to maxLength (schema definition error otherwise).
For validation of variable length string elements.

It is a processing error when a fixed-length string is found to have a number of characters not equal to the fixed number. For example, if a fixed-length string also has delimiters we might be able to successfully separate it from the surrounding elements depending on the delimiter specifications; however, if the length of the fixed-length string is not equal to the number specified as the fixed length then it is a processing error (not simply a validation error).[MB1]

[MB1]Contradicts statement that scanning for delimiters is off.(Discussed where dfdl:lengthKind=’explicit’ is described)

What is a fixed length string?

Clearly if it has lengthKind=”explicit” it is fixed length.

What if it has lengthKind=”implicit” and maxLength=”10”. Is that a fixed length string which shuts off delimiter scanning also? If so then this paragraph is erroneous and misleading.

<SMH>Yes, Tim and I have noted that this paragraph needs revising depending on the outcome of action 139 <SMH>

--------------------------------------------------------------------------------------

9.2 DFDL Syntax Grammar

Change to introduce concept of EnclosedItem or ChildItem (I used EnclosedItem below):

Sequence = LeftFraming SequenceContent RightFraming
SequenceContent = [ PrefixSeparator EnclosedItem [ Separator EnclosedItem ]* PostfixSeparator ] FinalUnusedRegion
EnclosedItem = Element | Array | ComplexContent[MB1]

Choice = LeftFraming ChoiceContent RightFraming
ChoiceContent = [ EnclosedItem ] FinalUnusedRegion[MB2]

[MB1]Refactored to share the EnclosedItem concept to Choices also.

Should perhaps be named ChildItem.

This is useful when discussing how parsing, defaulting, etc. work as well.

<SMH>Seems fine, it is effectively just a renaming<SMH>

---------------------------------------------------------------------------------------------------------------------------------------------------------------------

Table 14 Implicit Alignment in bits

Note: Specifying the implicit alignment in bits does not imply that dfdl:lengthUnits 'bits' can be specified for all simple types.[MB1]

[MB1]I do not understand this comment. What exactly is the restriction?

<SMH>It is really saying that alignmentUnits and lengthUnits are independent and have their own rules for when they are applicable. <SMH>

-----------------------------------------------------------------------------------------------------------------------------------------------------------------------

lengthKind Enum
Controls how the representation length of the component is determined.
Valid values are: 'explicit', 'delimited', 'prefixed', 'implicit', 'pattern', 'endOfParent'
A full description of each enumeration is given in the later sections.
'explicit' means the length of the item is given by the dfdl:length property
'delimited' means the item is delimited by a terminator or separator[MB1]
‘prefixed’ means the length of the item is given by an immediately preceding prefix field specified using prefixLengthType.
‘implicit means the length is to be determined in terms of the type of the element and its schema-specified properties if any.
‘pattern’ means the length of the item is given by a regular expression specified using the dfdl:lengthPattern property.
‘endOfParent’ means that the item is terminated by the termination of the containing construct.
Annotation: dfdl:element, dfdl:simpleType

[MB1]To me this is a very strong statement.

It means that an outside-in parse is allowed where for a sequence, we can scan and determine its end, and then parse the children.

It requires that “scan” is a well-defined concept for the contents of the sequence.

It means there can be nothing inside which requires the suspension of scanning.

It means contained elements that have length explicit and representation binary, are simply not allowed.

<SMH>Not entirely true, we allow some binary types to have delimited lengthKind (eg, BCD and Packed Decimals). There are formats out there that require this. <SMH>

To me it also means you cannot change the character set encoding, or have a contained element that itself uses an overlapping set of delimiters with the enclosing group’s delimiters. – Is that going too far? –

Delimited is like pattern. It restricts what is in the data stream substantially.

-------------------------------------------------------------------------------------------------------------------------------------------------------------------------

The rules for resolving ambiguity between delimiters are:

1. When two delimiters have a common prefix, the longest delimiter has precedence.
2. When two delimiters have exactly the same value, the innermost (most deeply nested) delimiter has precedence.
3. When the separator and terminator on a group have the same value, the separator has precedence.[MB1]

[MB1]By precedence, this must mean it is tried first, but the parser might backtrack and assume it to be a terminator instead.

This seems problematic to me. I’d like to either rule this out and say they can’t be the same, or see a use case where this is needed, a backtracking parser can parse it, and there’s no more reasonable way to structure the schema.

------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------

12.3.5.1 Pattern-Based Lengths - Scanability
Any element (complex, simple text, simple binary) may have a dfdl:lengthKind 'pattern' as long as the bytes in the content region of the element are legal in the stated encoding of that element. Where a complex element has children with binary representation in practice this means an 8-bit ASCII encoding. [MB1]

[MB1]Not necessarily ASCII. An 8-bit encoding such that any byte value is valid is the real requirement. The point is no single byte value is invalid, and no combinations of adjacent byte values are invalid so that any binary data won’t trip up a character conversion and subsequent scan.

Hmmm. The 8-bit character set used must have a transformation into Unicode which is bijective and information preserving. I.e., a unique Unicode character for each code point, and no “invalid” chracters which have no corresponding Unicode value. However, Ascii-based sets like 8859-1 are not strictly speaking, required.

-----------------------------------------------------------------------------------------------------------------------------------------------

12.3.7.1.3 Byte Order Mark

If a byte-order mark codepoint appears at the start of a UTF-8, [MB1] UTF-16 or UTF-32 encoded string then the byte-order mark will be included as part of the string payload[1]. That is, for the UTF-8, UTF-16 and UTF-32 character encodings, a byte-order-mark codepoint is treated as a character of the string in DFDL and contributes to the length.

A way of eliminating the byte-order mark so that it does not end up in the infoset is that the byte-order mark can be modeled as a separate element before the string. This BOM element can be either required or optional depending on whether one is expected or optional at the beginning of the string.

[1] Byte-order marks are explicitly stated to be “not characters” in the Unicode standard.

[MB1]No such thing as a BOM codepoint in a UTF-8 string. A UTF-8 byte sequence might encode the character code for a BOM, but this would be a meaningless inclusion of a BOM character code in a context where it will never be interpreted.

I suggest that we drop the term UTF-8 here, and BOM’s that get encoded when they are interpreted as character codes, and translated by the UTF-8 encoding algorithm into a multi-byte UTF-8 byte sequence, is handled the same way as other non-characters, i.e., what do we do when a high or low surrogate codepoint is present and we’re to encode as UTF-8. I think the answer is we run the UTF-8 encode/decode algorithm, and whatever Unicode character code it creates is what it creates, and if that happens to come out as any of the non-characters (BOMs, surrogates, others perhaps), so be it.

The topic is about Unicode non-characters, not specifically BOMs.

The general topic is encoding/decoding our infoset Unicode character codes which have no real representation in the specified encoding.

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-- dfdl-wg mailing list dfdl-wg@ogf.org http://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU