Mike
Comments in-line.
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB
Cc:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date:
17/11/2014 23:11
Subject:
Re: [DFDL-WG]
Fw: Fw: Action 260
So I looked into what would be changed in the spec to
make the adjustment suggested in this email thread, in intent.
Rewriting the section is very undesirable at this stage
for DFDL, so I was looking for incremental changes.
What I came up with is this. In section 14.2.2 simply
drop the sentences that mention this idea of an implied SSP, because they
suggest the need to reconcile conflicting behaviors in the mixed case.
I.e., drop all "The dfdl:separatorSuppressionPolicy is not applicable
and the implied behaviour is '....'."
With the omission of these sentences, the notion of an
implied separator suppression policy that conflicts with one on the sequence
goes away. There is only the SSP property of the sequence, and it is either
relevant to the decision of whether to expect an item and its separator,
or it isn't.
The current descriptions in 14.2.2 of the behavior around suppression for
each occursCountKind seem to be correct and match the table in this email
thread, except for parsed.
I'm not sure I agree with the assertion in this email thread that ock parsed
only makes sense with "anyEmpty" behavior. We currently say in
the spec that it has anyEmpty behavior, but if we drop that sentence (per
suggestion above), then we would be loosening the behavior to allow empty
elements in some cases.
SMH: The intent of the table is that it is
a schema definition error if 'parsed' and not 'anyEmpty'
Suppose:
<sequence dfdl:separatorSuppressionPolicy="trailingEmpty"
dfdl:separator="|" dfdl:separatorPosition="infix"
dfdl:terminator="%NL;">
<element name="a" type="xs:string"
maxOccurs="unbounded" dfdl:occursCountKind='parsed'/>
</sequence>
In this case, the array is declared last in its sequence.
The occurrences will all be element a, so this is positional. The number
of them is determined by OCK parsed, and if enabled, validation will check,
in this case, that at least 1 occurrence (the default for minOccurs) appears.
In this case, empty string is a legal value, and we're
not strict about trailing separators, so data like:
5|6|7||
is fine isn't it? The 'parsed' means you'll get 2 more
empty string elements for the array when parsing that will not be re-created
when unparsing, as they would be suppressed. I believe that is ok. There
are many formats that can have that sort of asymmetry.
SMH: To be clear, 'anyEmpty' already allows
empty content when parsing, it has a lax semantic
Change the above example to SSP trailingEmptyStrict, and now:
5,6,7||9
Now makes sense and you get one empty string in position 4. On unparsing
this empty string would even get written out.
I agree 'parsed' and SSP 'never' don't make sense together (as they don't
for OCK implicit - SMH: only if maxOccurs
'unbounded' - see table 18 in 14.2.2), but the other
3 SSPs seem ok to me for declared-last elements. OCK parsed behaves (w.r.t.
suppression, and ignoring defaulting and validation) just like OCK implicit
with minOccurs 0 and maxOccurs 'unbounded'.
(SMH: To be clear, defaulting and validation are independent of OCK).
If we want to preserve the current restriction that parsed
behaves like 'anyEmpty', then we can stipulate that when OCK is 'parsed',
any number of non-empty occurrences and their separators
are expected. (I would not be in favor of this.)
SMH: So what you are saying is that 'parsed'
is same as 'implicit' (0..unbounded). There are circumstances when
this combination gives a schema definition error, as per Table 18 in 14.2.2
- this must therefore be the same for 'parsed'. It turns out that you only
end up adding one extra legal behaviour for 'parsed', namely where element
is declared last and SSP is 'trailingEmpty/Strict'. That's the example
you quote above, but it's the only one. So is it buying much?
SMH: When IBM put the table below together
and said 'parsed' and SSP <> 'anyEmpty' is an error, it was to prevent
a subtle change in behaviour for existing schemas with 'trailingEmpty/Strict'.
Today the behaviour is 'anyEmpty', in the future it would be 'trailingEmpty/Strict'.
Hence making it an error.
As already noted in this thread, we also should add a
sentence to stopValue: "The
dfdl:stopValue property must not include empty string."
The net result of these changes still isn't all that great,
but it does remove one source of confusion - the 'implied' SSP conflict.
SMH: If your 'parsed' proposal is adopted,
you would need to document the 'parsed' behaviour as some combinations
are schema definition errors. Arguably you would need the equivalent of
Tables 18 and 19.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com
Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
On Mon, Oct 13, 2014 at 10:08 AM, Steve Hanson <smh@uk.ibm.com>
wrote:
IBM has discussed this issue at some
length internally, and come to the conclusion that the optional and array
elements in a sequence should follow the separator suppression policy (SSP),
and not have an implied SSP which is at odds with the sequence's SSP. In
the table below, cells marked with a cross imply a schema definition error,
and cells marked ok imply that there is a behaviour of the element for
that OCK which is in keeping with the SSP of the sequence.
SSP
(1)
|
OCK
|
fixed
| implicit
| expression
| parsed
| stopValue
(2)
|
never
| ok
| ok
| ok
| x
(5)
| ok
(6)
|
trailingEmpty
trailingEmptyStrict
| ok
(3)
| ok
| ok
(4)
| x
(5)
| ok
(6)
|
anyEmpty
| ok
(3)
| ok
| ok
(4)
| ok
| ok |
Notes:
(1) SSP property applies only to an ordered sequence. An unordered sequence
assumes 'anyEmpty' (as all optional/array elements must be 'parsed')
(2) Missing restriction - for 'stopValue' the dfdl:stopValue property must
not include empty string.
(3) maxOccurs provides count, so nothing is eligible for suppression, so
SSP has no practical effect (same as a required element)
(4) infoset provides count, so nothing is eligible for suppression, so
SSP has no practical effect (same as a required element)
(5) 'parsed' only makes sense with 'anyEmpty'
(6) Because a stop value must appear, and from (2) empty string is not
allowed, SSP has no practical effect.
The issue of maxOccurs = '0' is discussed in a separate email.
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 13/10/2014 14:40 -----
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: Steve
Hanson/UK/IBM@IBMGB,
Cc: "dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date: 29/08/2014
21:21
Subject: Re:
[DFDL-WG] Fw: Action 260
I reviewed this. It looks good to me.
The note at the bottom that we don't say what happens on a zero-trip I.e.,
a represented element, but where occursCount evaluates to 0, is a useful
clarification also.
Do we want to create an erratum for this?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
On Thu, Aug 28, 2014 at 10:13 AM, Steve Hanson <smh@uk.ibm.com>
wrote:
Please review for Tuesday's WG call ...
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 28/08/2014 15:02 -----
From: Steve
Hanson/UK/IBM
To: dfdl-wg@ogf.org,
Date: 06/08/2014
13:50
Subject: Fw:
[DFDL-WG] Action 260
So my suggestion below, to wrap the array in a sequence, does not work;
it just moves the problem down into the new sequence.
After much deliberation, we think that the definitions of Positional sequence
and Non-positional sequence should not be viewed as driving the behaviour
of a sequence, but simply as the resultant characteristics of a sequence
that has certain properties. That leaves modellers free to mix occursCountKinds,
as in Tim's example. No need for any new SDE scenarios.
Positional sequence - Each occurrence in the sequence can be
identified by its position in the data. Typically the components of such
a sequence do not have an initiator. In some such sequences, the separators
for optional zero-length occurrences may or must be omitted when at the
end of the group. In
DFDL, a sequence is considered positional if it contains only required
elements and/or optional and array elements that have dfdl:occursCountKind
'implicit', 'fixed' or 'expression', and it has dfdl:separatorSuppressionPolicy
'never', 'trailingEmptyStrict' or 'trailingEmpty'.
Non-positional sequence - Occurrences
in the sequence cannot be identified by their position in the data alone.
Often
the components of such a sequence have an initiator. Such sequences sometimes
allow the separator to be omitted for optional zero-length occurrences
anywhere in the sequence. Speculative parsing might
need to be employed by to identify
each occurrence. In
DFDL, a sequence is non-positional if it contains any optional or array
elements that have dfdl:occursCountKind 'parsed' or 'stopValue', and/or
it has dfdl:separatorSuppressionPolicy 'anyEmpty'.
See parallel email for action 261 that ensures 'expression' behaves itself.
One behaviour that is missing from the spec. For a sequence with separators,
what is expected in the data stream if occursCount = 'fixed' / 'implicit'
and maxOccurs = '0', or occursCountKind = 'expression' and occursCount
evaluates to 0 ? We believe that no separator should be expected
when parsing and none output when unparsing (same behaviour as inputValueCalc).
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 06/08/2014 12:42 -----
From: Steve
Hanson/UK/IBM
To: Tim
Kimber/UK/IBM@IBMGB,
Cc: dfdl-wg@ogf.org,
dfdl-wg-bounces@ogf.org
Date: 30/06/2014
10:04
Subject: Re:
[DFDL-WG] Action 260
You would wrap the array and it's count in a sequence. Then the 'count+array'
is treated as a single entity as far as the parent sequence is concerned.
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: Tim
Kimber/UK/IBM@IBMGB
To: dfdl-wg@ogf.org,
Date: 26/06/2014
20:06
Subject: Re:
[DFDL-WG] Action 260
Sent by: dfdl-wg-bounces@ogf.org
Before we settle one way or the other, I would like the following data
format to be taken into consideration.
chars,5,A,B,C,D,E,integers,1,2,3
chars,3,C,,,integers,2,10,11
I am assuming that the occursCountKind for the arrays is 'expression' and
the occursCount refers to the integer field that precedes the array. In
order to represent the empty strings on the second line it is essential
to specify SSP as 'trailingEmpty' or 'never'. If we disallow the combination
of ock='expression' and SSP='trailingEmpty' then how would this format
be modelled?
regards,
Tim Kimber,
Technical Lead for IBM Integration Bus Healthcare Pack
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 37246742
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: Steve
Hanson/UK/IBM@IBMGB,
Cc: "dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date: 25/06/2014
16:25
Subject: Re:
[DFDL-WG] Action 260
Sent by: dfdl-wg-bounces@ogf.org
I prefer choice (a) for two reasons
* It is more restrictive and therefore more conservative (preserving freedom
to change in future if needed)
* If a user has a positional data format, you don't want them to even have
to understand the concept of speculation in order to model their data.
So choice (a) allows a simpler description that doesn't need to introduce
the notion that the parser might be speculation.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
On Wed, Jun 25, 2014 at 5:20 AM, Steve Hanson <smh@uk.ibm.com>
wrote:
260
| Positional
and non-positional sequences (All)
10/6: Spec defines the above but also allows different occursCountKinds
within the same sequence which may have different (implied) separatorSuppressionPolicy,
which results in a sequence which is a mixture of both. Should this be
allowed? If so what are the rules? Can certain combinations be disallowed?
17/6: IBM have discussed internally and will submit a proposal. |
In the spec we define Positional Sequence
and Non-Positional Sequence:
Positional sequence - Each
occurrence in the sequence can be identified by its position in the data.
Typically the components of such a sequence do not have an initiator. In
some such sequences, the separators for optional zero-length occurrences
may or must be omitted when at the end of the group. A positional sequence
can be modelled by setting dfdl:separatorSuppressionPolicy to 'never',
'trailingEmptyStrict' or 'trailingEmpty'.
Non-positional sequence - Occurrences
in the sequence cannot be identified by their position in the data alone.
Typically the components of such a sequence have an initiator. Such sequences
allow the separator to be omitted for optional zero-length occurrences
anywhere in the sequence. Speculative parsing is employed by the parser
to identify each occurrence. A non-positional sequence can be modelled
by setting dfdl:separatorSuppressionPolicy to 'anyEmpty'.
The problem is that the setting of dfdl:separatorSuppressionPolicy
is only examined for child elements with dfdl:occursCountKind 'implicit'.
For other dfdl:occursCountKinds, there is the concept of an 'implied'
dfdl:separatorSuppressionPolicy:
When dfdl:occursCountKind is 'fixed' then
... the implied behaviour is 'never'.
When dfdl:occursCountKind is 'expression'
... the implied behaviour is 'never'.
When dfdl:occursCountKind is 'parsed' ...
the implied behaviour is 'anyEmpty'.
When dfdl:occursCountKind is 'stopValue'
...the implied behaviour is 'anyEmpty'.
So if a Positional sequence as defined
above contains children with dfdl:occursCountKind 'parsed' or 'stopValue'
then surely it is no longer a Positional sequence.
A solution to this is to prevent the
appearance of certain values of dfdl:occursCountKind within a Positional
sequence. However, precisely which values to outlaw is subject to interpretation
of the phrase "Each occurrence
in the sequence can be identified by its position in the data".
Is this intended to mean:
a) an observer of the raw data can
identify an occurrence of an element in the sequence solely by counting
separators
=> SDE if 'parsed', 'stopValue' or
'expression' ** appeared in a Positional sequence;
** Although 'expression' would appear
to be like 'fixed' it actually breaks a) so must be included in the SDE
list.
or
b) a parser does not have to speculate
to identify an occurrence of an element in the sequence
=> SDE only if 'parsed' appeared in
a Positional sequence.
Note that it is possible to wrap a 'parsed'
etc element in a local sequence or another element to avoid an SDE. But
this could still be seen as a violation of a) if the separators of both
are the same, as the observer can not count the separators. So should the
rule be applied recursively, ie, a Positional sequence can not contain
a non-Positional sequence unless the separators are different?
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU