Thanks Mike. I have added a couple of comments below...we can discuss
fully on the calls.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh(a)uk.ibm.com
tel:+44-1962-815848
From:
Mike Beckerle <mbeckerle.dfdl(a)gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB
Date:
15/06/2011 01:59
Subject:
Issues to add to work items list
Steve,
Below is the list I have so far of items in the spec that go beyond just
typos - where re-wording is required or advisable. All but the last of
these is tied up with the length & delimiters issue, but separate of the
"known to exist" vs. "missing" topic.
I've not bothered to provide anything about the section on "known to
exist" and such. That section is already the discussion of a work
item/topic.
The last one is just a nit about BOMs.
...mikeb
--------------------------------------------------------------------------------------------
5.2.2 MinLength, MaxLength
These facets are used:
When dfdl:lengthKind=”implicit”. In that case the length is given by the
value of xs:maxLength. In this case minLength if specified is required to
be equal to maxLength (schema definition error otherwise).
For validation of variable length string elements.
It is a processing error when a fixed-length string is found to have a
number of characters not equal to the fixed number. For example, if a
fixed-length string also has delimiters we might be able to successfully
separate it from the surrounding elements depending on the delimiter
specifications; however, if the length of the fixed-length string is not
equal to the number specified as the fixed length then it is a processing
error (not simply a validation error).[MB1]
[MB1]Contradicts statement that scanning for delimiters is off.(Discussed
where dfdl:lengthKind=’explicit’ is described)
What is a fixed length string?
Clearly if it has lengthKind=”explicit” it is fixed length.
What if it has lengthKind=”implicit” and maxLength=”10”. Is that a fixed
length string which shuts off delimiter scanning also? If so then this
paragraph is erroneous and misleading.
<SMH>Yes, Tim and I have noted that this paragraph needs revising
depending on the outcome of action 139 <SMH>
--------------------------------------------------------------------------------------
9.2 DFDL Syntax Grammar
Change to introduce concept of EnclosedItem or ChildItem (I used
EnclosedItem below):
Sequence = LeftFraming SequenceContent RightFraming
SequenceContent = [ PrefixSeparator EnclosedItem [ Separator EnclosedItem
]* PostfixSeparator ] FinalUnusedRegion
EnclosedItem = Element | Array | ComplexContent[MB1]
Choice = LeftFraming ChoiceContent RightFraming
ChoiceContent = [ EnclosedItem ] FinalUnusedRegion[MB2]
[MB1]Refactored to share the EnclosedItem concept to Choices also.
Should perhaps be named ChildItem.
This is useful when discussing how parsing, defaulting, etc. work as well.
<SMH>Seems fine, it is effectively just a renaming<SMH>
---------------------------------------------------------------------------------------------------------------------------------------------------------------------
Table 14 Implicit Alignment in bits
Note: Specifying the implicit alignment in bits does not imply that
dfdl:lengthUnits 'bits' can be specified for all simple types.[MB1]
[MB1]I do not understand this comment. What exactly is the restriction?
<SMH>It is really saying that alignmentUnits and lengthUnits are
independent and have their own rules for when they are applicable. <SMH>
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------
lengthKind
Enum
Controls how the representation length of the component is determined.
Valid values are: 'explicit', 'delimited', 'prefixed', 'implicit',
'pattern', 'endOfParent'
A full description of each enumeration is given in the later sections.
'explicit' means the length of the item is given by the dfdl:length
property
'delimited' means the item is delimited by a terminator or separator[MB1]
‘prefixed’ means the length of the item is given by an immediately
preceding prefix field specified using prefixLengthType.
‘implicit means the length is to be determined in terms of the type of the
element and its schema-specified properties if any.
‘pattern’ means the length of the item is given by a regular expression
specified using the dfdl:lengthPattern property.
‘endOfParent’ means that the item is terminated by the termination of the
containing construct.
Annotation: dfdl:element, dfdl:simpleType
[MB1]To me this is a very strong statement.
It means that an outside-in parse is allowed where for a sequence, we can
scan and determine its end, and then parse the children.
It requires that “scan” is a well-defined concept for the contents of the
sequence.
It means there can be nothing inside which requires the suspension of
scanning.
It means contained elements that have length explicit and representation
binary, are simply not allowed.
<SMH>Not entirely true, we allow some binary types to have delimited
lengthKind (eg, BCD and Packed Decimals). There are formats out there that
require this. <SMH>
To me it also means you cannot change the character set encoding, or have
a contained element that itself uses an overlapping set of delimiters with
the enclosing group’s delimiters. – Is that going too far? –
Delimited is like pattern. It restricts what is in the data stream
substantially.
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------
The rules for resolving ambiguity between delimiters are:
1. When two delimiters have a common prefix, the longest delimiter
has precedence.
2. When two delimiters have exactly the same value, the innermost
(most deeply nested) delimiter has precedence.
3. When the separator and terminator on a group have the same value,
the separator has precedence.[MB1]
[MB1]By precedence, this must mean it is tried first, but the parser
might backtrack and assume it to be a terminator instead.
This seems problematic to me. I’d like to either rule this out and say
they can’t be the same, or see a use case where this is needed, a
backtracking parser can parse it, and there’s no more reasonable way to
structure the schema.
------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
12.3.5.1 Pattern-Based Lengths - Scanability
Any element (complex, simple text, simple binary) may have a
dfdl:lengthKind 'pattern' as long as the bytes in the content region of
the element are legal in the stated encoding of that element. Where a
complex element has children with binary representation in practice this
means an 8-bit ASCII encoding. [MB1]
[MB1]Not necessarily ASCII. An 8-bit encoding such that any byte value is
valid is the real requirement. The point is no single byte value is
invalid, and no combinations of adjacent byte values are invalid so that
any binary data won’t trip up a character conversion and subsequent scan.
Hmmm. The 8-bit character set used must have a transformation into Unicode
which is bijective and information preserving. I.e., a unique Unicode
character for each code point, and no “invalid” chracters which have no
corresponding Unicode value. However, Ascii-based sets like 8859-1 are
not strictly speaking, required.
-----------------------------------------------------------------------------------------------------------------------------------------------
12.3.7.1.3 Byte Order Mark
If a byte-order mark codepoint appears at the start of a UTF-8, [MB1]
UTF-16 or UTF-32 encoded string then the byte-order mark will be included
as part of the string payload[1]. That is, for the UTF-8, UTF-16 and
UTF-32 character encodings, a byte-order-mark codepoint is treated as a
character of the string in DFDL and contributes to the length.
A way of eliminating the byte-order mark so that it does not end up in the
infoset is that the byte-order mark can be modeled as a separate element
before the string. This BOM element can be either required or optional
depending on whether one is expected or optional at the beginning of the
string.
[1] Byte-order marks are explicitly stated to be “not characters” in the
Unicode standard.
[MB1]No such thing as a BOM codepoint in a UTF-8 string. A UTF-8 byte
sequence might encode the character code for a BOM, but this would be a
meaningless inclusion of a BOM character code in a context where it will
never be interpreted.
I suggest that we drop the term UTF-8 here, and BOM’s that get encoded
when they are interpreted as character codes, and translated by the UTF-8
encoding algorithm into a multi-byte UTF-8 byte sequence, is handled the
same way as other non-characters, i.e., what do we do when a high or low
surrogate codepoint is present and we’re to encode as UTF-8. I think the
answer is we run the UTF-8 encode/decode algorithm, and whatever Unicode
character code it creates is what it creates, and if that happens to come
out as any of the non-characters (BOMs, surrogates, others perhaps), so be
it.
The topic is about Unicode non-characters, not specifically BOMs.
The general topic is encoding/decoding our infoset Unicode character codes
which have no real representation in the specified encoding.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU