From: | Steve Hanson/UK/IBM |
To: | dfdl-wg@ogf.org |
Date: | 27/06/2011 17:27 |
Subject: | Spec issues from Mike Beckerle |
From: | Mike Beckerle <mbeckerle.dfdl@gmail.com> |
To: | Steve Hanson/UK/IBM@IBMGB |
Date: | 15/06/2011 01:59 |
Subject: | Issues to add to work items list |
[MB1]Contradicts statement that scanning for delimiters is off.(Discussed where dfdl:lengthKind=’explicit’ is described)
What is a fixed length string?
Clearly if it has lengthKind=”explicit” it is fixed length.
What if it has lengthKind=”implicit” and maxLength=”10”. Is that a fixed length string which shuts off delimiter scanning also? If so then this paragraph is erroneous and misleading.
<SMH>Yes, Tim and I have noted that this paragraph needs revising depending on the outcome of action 139 <SMH>
--------------------------------------------------------------------------------------
9.2 DFDL Syntax Grammar
Change to introduce concept of EnclosedItem or ChildItem (I used EnclosedItem below):
Sequence = LeftFraming SequenceContent RightFraming SequenceContent = [ PrefixSeparator EnclosedItem [ Separator EnclosedItem ]* PostfixSeparator ] FinalUnusedRegion EnclosedItem = Element | Array | ComplexContent[MB1]
|
Choice = LeftFraming ChoiceContent RightFraming ChoiceContent = [ EnclosedItem ] FinalUnusedRegion[MB2] |
[MB1]Refactored to share the EnclosedItem concept to Choices also.
Should perhaps be named ChildItem.
This is useful when discussing how parsing, defaulting, etc. work as well.
<SMH>Seems fine, it is effectively just a renaming<SMH>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Table 14 Implicit Alignment in bits
Note: Specifying the implicit alignment in bits does not imply that dfdl:lengthUnits 'bits' can be specified for all simple types.[MB1]
[MB1]I do not understand this comment. What exactly is the restriction?
<SMH>It is really saying that alignmentUnits and lengthUnits are independent and have their own rules for when they are applicable. <SMH>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
lengthKind | Enum
Controls how the representation length of the component is determined. Valid values are: 'explicit', 'delimited', 'prefixed', 'implicit', 'pattern', 'endOfParent' A full description of each enumeration is given in the later sections. 'explicit' means the length of the item is given by the dfdl:length property 'delimited' means the item is delimited by a terminator or separator[MB1] ‘prefixed’ means the length of the item is given by an immediately preceding prefix field specified using prefixLengthType. ‘implicit means the length is to be determined in terms of the type of the element and its schema-specified properties if any. ‘pattern’ means the length of the item is given by a regular expression specified using the dfdl:lengthPattern property. ‘endOfParent’ means that the item is terminated by the termination of the containing construct. Annotation: dfdl:element, dfdl:simpleType |
[MB1]To me this is a very strong statement.
It means that an outside-in parse is allowed where for a sequence, we can scan and determine its end, and then parse the children.
It requires that “scan” is a well-defined concept for the contents of the sequence.
It means there can be nothing inside which requires the suspension of scanning.
It means contained elements that have length explicit and representation binary, are simply not allowed.
<SMH>Not entirely true, we allow some binary types to have delimited lengthKind (eg, BCD and Packed Decimals). There are formats out there that require this. <SMH>
To me it also means you cannot change the character set encoding, or have a contained element that itself uses an overlapping set of delimiters with the enclosing group’s delimiters. – Is that going too far? –
Delimited is like pattern. It restricts what is in the data stream substantially.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The rules for resolving ambiguity between delimiters are:
1. When two delimiters have
a common prefix, the longest delimiter has precedence.
2. When two delimiters have
exactly the same value, the innermost (most deeply nested) delimiter has
precedence.
3. When the separator and
terminator on a group have the same value, the separator has precedence.[MB1]
[MB1]By precedence, this must mean it is tried first, but the parser might backtrack and assume it to be a terminator instead.
This seems problematic to me. I’d like to either rule this out and say they can’t be the same, or see a use case where this is needed, a backtracking parser can parse it, and there’s no more reasonable way to structure the schema.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
12.3.5.1 Pattern-Based Lengths - Scanability
Any element (complex, simple text, simple binary) may have a dfdl:lengthKind
'pattern' as long as the bytes in the content region of the element
are legal in the stated encoding of that element. Where a complex element
has children with binary representation in practice this means an 8-bit
ASCII encoding. [MB1]
[MB1]Not necessarily ASCII. An 8-bit encoding such that any byte value is valid is the real requirement. The point is no single byte value is invalid, and no combinations of adjacent byte values are invalid so that any binary data won’t trip up a character conversion and subsequent scan.
Hmmm. The 8-bit character set used must have a transformation into Unicode which is bijective and information preserving. I.e., a unique Unicode character for each code point, and no “invalid” chracters which have no corresponding Unicode value. However, Ascii-based sets like 8859-1 are not strictly speaking, required.
-----------------------------------------------------------------------------------------------------------------------------------------------
12.3.7.1.3 Byte Order Mark
If a byte-order mark codepoint appears at the start of a UTF-8, [MB1] UTF-16 or UTF-32 encoded string then the byte-order mark will be included as part of the string payload[1]. That is, for the UTF-8, UTF-16 and UTF-32 character encodings, a byte-order-mark codepoint is treated as a character of the string in DFDL and contributes to the length.
A way of eliminating the byte-order mark so that it does not end up in the infoset is that the byte-order mark can be modeled as a separate element before the string. This BOM element can be either required or optional depending on whether one is expected or optional at the beginning of the string.
[1] Byte-order marks are explicitly stated to be “not characters” in the Unicode standard.
[MB1]No such thing as a BOM codepoint in a UTF-8 string. A UTF-8 byte sequence might encode the character code for a BOM, but this would be a meaningless inclusion of a BOM character code in a context where it will never be interpreted.
I suggest that we drop the term UTF-8 here, and BOM’s that get encoded when they are interpreted as character codes, and translated by the UTF-8 encoding algorithm into a multi-byte UTF-8 byte sequence, is handled the same way as other non-characters, i.e., what do we do when a high or low surrogate codepoint is present and we’re to encode as UTF-8. I think the answer is we run the UTF-8 encode/decode algorithm, and whatever Unicode character code it creates is what it creates, and if that happens to come out as any of the non-characters (BOMs, surrogates, others perhaps), so be it.
The topic is about Unicode non-characters, not specifically BOMs.
The general topic is encoding/decoding our infoset Unicode character codes which have no real representation in the specified encoding.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU