For discussion on next DFDL WG call.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh(a)uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 07/01/2013 17:32 -----
From: Mike Beckerle <mbeckerle.dfdl(a)gmail.com>
To: Steve Hanson/UK/IBM@IBMGB,
Cc: Tim Kimber/UK/IBM@IBMGB
Date: 11/12/2012 17:15
Subject: Re: Editorial improvements for section 14.2
Some added discussion on top of steve's on 14.2 separator property.
From: Tim Kimber/UK/IBM
To: mbeckerle.dfdl(a)gmail.com, Steve Hanson/UK/IBM@IBMGB,
Date: 10/12/2012 15:14
Subject: Editorial improvements for section 14.2
A couple of things that I noticed while looking through the specification
today:
14.2 Title
Section title should really be 'Sequence groups with separators'.
SMH: Agree
14.2 Description of 'separator' property
"Specifies a whitespace separated list of alternative literal strings that
are the possible separators between a sequence of elements or multiple
occurrences of an element."
A separator applies to all members of a group, but this only talks about
elements.
Suggestion: "Specifies a list of alternative separator values for the
group. Each separator value is a DFDL string literal. If there is more
than one separator in the list then the values are separated by white
space."
I purposely omitted the point about multiple occurrences; I think it needs
a separate description, unless we think that the tables make it clear
enough.
SMH: The wording here is very like that for initiator and terminator. The
property type already has said that the strings are DFDL string literals.
So I would say:
"Specifies a whitespace separated list of alternative literal strings that
are the possible separators for the sequence. Separators occur in the data
either before, between or after all occurrences of the elements or groups
that are the children of the sequence."
14.2 Description of 'separator' property
"This property can be computed by way of an expression which returns a
string of whitespace separated values.
It is a Schema Definition Error if the expression returns an empty string
The expression must not contain forward references to elements which have
not yet been processed."
The later sentence about expressions that return an empty string could
then be removed - I think it belongs in this paragraph.
Also, there is a change in the text style midway through the paragraph.
14.2 Description of 'separator' property
"When parsing, the list of values is processed in a greedy manner, meaning
it takes all the separators, that is, each of the string literals in the
white space separated list, and matches them each against the data. In
each case the longest possible match is found. The separator with the
longest match as the one that is selected as having been ‘found’, with
length-ties being resolved so that the matching separator is selected that
is first in the order written in the schema. Once a matching separator is
found, no other shorter matches will be subsequently attempted (ie, there
is no backtracking to try parsing based on shorter separator matches)."
I don't know what the correct wording is, but this is not it :-)
This is a very complex piece of logic to describe, but it is fairly
central to the parsing algorithms. If we don't get it right then we will
end up with divergent DFDL implementations. I honestly don't know where or
how we should be describing the delimiter parsing logic - can we discuss
on the next WG call?
SMH: This paragraph is solely describing how the matching works, not
anything else. It is independent of lengthKind. This wording was agreed
under errata 2.70 and is used for initiator and terminator as well. What
specifically is the issue?
MB: It's really unfortunate that there's this ambiguity about length-ties.
But those can come up due to the character class entities. I.e., I can
write separator="%SP;|%SP; %WSP+|%WSP+;" and both those would match as a
separator of a and b in "a | b".
MB: However, I'm not sure the above purple wording is really needed about
length-ties. If a separator longest-matches, we're done. We don't really
care if there are two separator patterns that are ambiguous and can match
the same thing. If they both match the same 'longest' match, then the
separator was found.
14.2 Description of 'separator' property
"If a child element uses an escape scheme, then the escape scheme also
applies to any separator."
What does this mean? Can we remove it?
SMH: It means that when unparsing a child element then an occurrence of
the separator in the value will be escaped.
14.1 Empty Sequences
Doesn't seem right to have this as the very first sub-section. Can we make
it the last, and move the other sections up by one? Or at least swap it
with 14.2?
SMH: I don't see it makes much difference where it goes. So on the grounds
of spec renumbering I'd prefer if it stayed where it is.
regards,
Tim Kimber, DFDL Team,
Hursley, UK
Internet: kimbert(a)uk.ibm.com
Tel. 01962-816742
Internal tel. 37246742
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
Mike Beckerle | OGF DFDL WG Co-Chair | Tresys Technologies
Tel: 781-330-0412
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU