Closed. Agreed that 16.1.4 that
says that occursCountKind 'expression' behaves like 'parsed' when unparsing
is misleading, and needs rewording. The count is the number of occurrences
in the augmented infoset. Separators are never suppressed, as already stated
in 14.2.3. Erratum 3.11 updated. To be included in draft
r23.
Proposal to evaluate the occursCount
expression during unparsing as an extra 'validation' check was not accepted.
However, Mike has noticed that the spec does not explicitly state that
asserts and discriminators are not evaluated during unparsing. Erratum
2.166 created. To be included in draft r23.
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve
Hanson/UK/IBM on 03/09/2014 13:15 -----
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB,
Cc:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date:
29/08/2014 22:20
Subject:
Re: [DFDL-WG]
Fw: Action 261
I also reviewed this.
I like the symmetry improvement of OCK expression unparsing being 'never'
suppression policy and taking the occurs count from the number of items
in the augmented infoset.
I don't like the evaluation of occurs expression on unparsing
as a check, and length expression evaluation on unparsing as a check. This
allows unparsing of data that cannot be parsed again. But there are many
many such holes, and we can't plug them all.
We don't evaluate assert/discriminator tests when unparsing,
so why these other checks?
(Note: I just did a quick search through the spec for
all uses of "assert", and didn't find a statement that says they
are only evaluated when parsing.)
If we want to give the option for unparsing to evaluate occurs and length
expressions (and assets/discrims too?) that's worth consideration, but
I'd prefer that this be an implementation-specific flag/mode and not part
of the standard (for now.)
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com
Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
On Thu, Aug 28, 2014 at 10:13 AM, Steve Hanson <smh@uk.ibm.com>
wrote:
Please review for Tuesday's WG call...
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 28/08/2014 15:01 -----
From: Steve
Hanson/UK/IBM
To: dfdl-wg@ogf.org,
Date: 06/08/2014
13:50
Subject: Fw:
[DFDL-WG] Action 261
In the spec as it stands, occursCountKind 'expression' has behaviour like
occursCountKind 'fixed' when parsing (ie, there is a count that is fixed
before the occurrences are parsed), but is also stated to have behaviour
like occursCountKind 'parsed' when unparsing (ie, there is no count that
is fixed before occurrences are unparsed). In terms of expected separators,
it behaves like 'never' when parsing, but like 'anyEmpty' when unparsing.
This leads to undesirable asymmetric behaviour, such as this example.
Data stream is 5,A,B,,D,E. To preserve indexes in the infoset after parsing,
I make the array element nillable with ES as nil value. But on unparsing
I will get 5,A,B,D,E because the 'anyEmpty' behaviour has suppressed the
output of the separator for the zero-length nil value. Not only will modellers
find this behaviour odd, it breaks dfdl:outputValueCalc on the count element
if fn:count() is used. (Same happens if not nillable but minOccurs >
2 so empty string ends up in the infoset).
The proposal is that 'expression' behaves like 'never' on unparsing, the
count being taken as the number of items in the augmented infoset.
That way the application or outputValueCalc can be certain that no separators
will be suppressed, and outputValueCalc will work.
It was proposed further down this email thread that the expression is evaluated
at the start of unparsing the element and used to obtain the count. There
are scenarios where that causes problems though, hence the above is preferred.
However, to ensure that the output data stream can be re-parsed, it is
proposed that the expression is evaluated at the end of the unparsing
of the element (and any elements with expressions dependent on it), and
if it fails to match the number that was output, it is a processing error.
Similarly for lengthKind 'explicit' where length is an expression. The
element continues to be considered variable length on unparsing as the
spec says today, but the expression is evaluated at the end of the unparsing
of the element, and if it fails to match the length that was output, it
is a processing error.
It would be great to do the analogous thing for lengthKind 'pattern' but
it is not always possible as some patterns look ahead beyond the element
content in order to match. So it could be done for some patterns
but not all. Implementation-dependent?
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 06/08/2014 13:08 -----
From: Steve
Hanson/UK/IBM
To: dfdl-wg@ogf.org,
Date: 25/06/2014
11:33
Subject: Fw:
[DFDL-WG] Action 261
261
| Implied
separatorSuppressionPolicy for occursCountKind 'expression' (All)
10/6: Spec says it is 'never' (positional sequence) but you have to parse
to identify the position, so isn't that non-positional?
17/6: Some other issues noted around 'expression' as per email thread.
IBM have discussed this internally and will submit a proposal. |
As was noted in the email for Action
260, if it is decided that the meaning of "Each
occurrence in the sequence can be identified by its position in the data"
is more strictly that an observer of the raw data can identify an occurrence
of an element in the sequence solely by counting separators then that
would appear to make dfdl:occursCountKind 'expression' more like 'parsed',
and not eligible to be in a Positional sequence. But if the meaning is
a parser does not have to speculate to identify an occurrence of an
element in the sequence then it can be in a Positional sequence.
While discussing the nature of 'expression'
it was noted that it is very easy for a DFDL user to create a data stream
from an infoset and for that data stream to be un-parse-able. If dfdl:outputValueCalc
is not used, then the element(s) in the infoset must be correctly set manually
to match the number of occurrences. The same observation applies to dfdl:lengthKind
'explicit' where dfdl:length is an expression.
To address this, IBM proposes the following
changes to the DFDL specification for occursCountKind 'expression' when
unparsing:
- When all occurrences have been obtained
from the infoset (and defaulting applied if needed), the occursCount expression
is evaluated
- If any element that is referenced by
the expression has dfdl:outputValueCalc, then that expression is (re-)evaluated
as part of the above; given that the number of occurrences is now known,
the outputValueCalc expression should now evaluate successfully
- If the result of the occursCount does
not match the number of occurrences it is a processing error
This
ensures the integrity of the data stream (the non-outputValueCalc use case),
while still allowing the infoset to dictate the count (the outputValueCalc
use case).
Similarly, IBM proposes the following
changes to the DFDL specification for lengthKind 'explicit', where length
is an expression, when unparsing:
- When an element has been obtained from
the infoset (and defaulting applied if needed) and unparsed but before
padding is applied, the expression is evaluated to give a length.
- If any element that is referenced by
the expression has dfdl:outputValueCalc, then that expression is (re-)evaluated;
given that the unpadded length is now known, the outputValueCalc expression
should now evaluate successfully
- Now that we have a length, the unparser
behaviour is the same as if the length was obtained from a fixed dfdl:length
value. (That is, truncation, padding, filling or error depending on other
property settings).
This ensures
the integrity of the data stream (the non-outputValueCalc use case), while
still allowing the infoset to dictate the length (the outputValueCalc use
case). Note that it means the 'awkward' behaviour whereby lengthKind 'explicit'
(expression) is a specified length when parsing but variable length when
unparsing is avoided - it is now always specified length.
IBM believes that none of the above should
impose additional burden on implementers, as no brand new behaviour is
being added.
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 25/06/2014 10:24 -----
From: Tim
Kimber/UK/IBM
To: Steve
Hanson/UK/IBM@IBMGB, Alex Wood1/UK/IBM@IBMGB, Andrew Edwards/UK/IBM@IBMGB,
Mark Frost/UK/IBM@IBMGB,
Date: 16/06/2014
22:08
Subject: Re:
[DFDL-WG] Action 261
I think this needs to be discussed before the meeting tomorrow. I wanted
to avoid turning this thread into a monster - but I don't think it's possible
to discuss without the context so I've continued the thread as before with
<tk> tags.
I'm clear in my own mind that I understand the issues now. We should
give priority to the separatorSuppressionPolicy question because the current
rules are close to being unimplementable. The occursCountKind and lengthKind
questions are important but are at least implementable as they stand.
regards,
Tim Kimber,
IBM Integration Bus Development (Industry Packs)
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 37246742
From: Steve
Hanson/UK/IBM
To: Tim
Kimber/UK/IBM@IBMGB,
Cc: dfdl-wg@ogf.org,
dfdl-wg-bounces@ogf.org
Date: 11/06/2014
16:12
Subject: Re:
[DFDL-WG] Action 261
Replies in <smh>
tags
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: Tim
Kimber/UK/IBM@IBMGB
To: dfdl-wg@ogf.org,
Date: 11/06/2014
13:58
Subject: Re:
[DFDL-WG] Action 261
Sent by: dfdl-wg-bounces@ogf.org
comments in <tk>tags
regards,
Tim Kimber,
IBM Integration Bus Development (Industry Packs)
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 37246742
From: Steve
Hanson/UK/IBM
To: Tim
Kimber/UK/IBM@IBMGB,
Cc: dfdl-wg@ogf.org,
dfdl-wg-bounces@ogf.org
Date: 11/06/2014
10:47
Subject: Re:
[DFDL-WG] Action 261
Some thoughts on this...
I agree that the definition of positional sequence in the spec needs tightening
as it is ambiguous as it stands and could be interpreted as a) or b). If
we adopted b) then that would appear to allow 'expression' to appear in
a positional sequence, but wouldn't it also allow 'stopValue'?
<tk>Yes - according to definition b) stopValue would be allowable
in a positional sequence. We could still disallow it if we do not believe
there is any benefit in allowing it. I don't believe it introduces any
particular complexities for an implementer.</tk>
occursCountKind 'expression' is analogous to lengthKind 'explicit' with
an expression and to lengthKind 'prefixed'. Both these lengthKinds are
classified as 'specified length' when parsing but 'variable length' when
unparsing. We are observing that occursCountKind 'expression' is like 'fixed'
when parsing but not quite so like 'fixed' when unparsing - which is why
section 16 groups 'expression' with 'parsed' for unparsing.
<tk>Yes - we took a decision that the unparser should ignore the
expression in lengthKind/occursCountKind, and just output whatever data
happens to be in the info set. I'm
not sure that it saves a lot of effort in the implementation and it certainly
is not easy to justify as a consistent behaviour. For me, the unparser
should treat lengthKind='explicit' the same way whether the value is static
or calculated. And the unparser should treat occursCountKind='expression'
the same way as occursCountKind='fixed'.
</tk>
When unparsing occursCountKind 'expression' you don't always have the calculated
array length N. If the infoset was derived from XML, there is likely no
'count' element, just a bunch of elements with the same name that make
up the 'array'. DFDL gives you the choice whether to a)
manually set the count element, or
b) have the unparser set it automatically
via outputValueCalc. In the former case, you can create a document that
can not be parsed; the
unparser could check the 'count' element matches the infoset, but that
would involve reverse engineering an arbitrarily complex expression and
is why the specification does not say that.
<tk>It would involve evaluating the expression. In most cases, that
will not require any lookahead because the Length/Count field will precede
the array or element. Not sure where the reverse engineering comes in?</tk>
<smh>I see what you are saying. Just evaluate the expression and
see what it gives for N. That handles case a) but not b) where I explicitly
want the unparser to set the count via outputValueCalc - which is presumably
referring to the number of elements in the array, which is not known. For
case b) N has to be the number in the infoset. Given that we have to support
case b) the unparser can not treat occursCountKind 'expression' exactly
the same as 'fixed' when unparsing.</smh>
<smh>Similarly with lengthKind 'explicit' with an expression. For
the equivalent to case a) the length is known which makes the length fixed,
but for the equivalent of case b) with outputValueCalc the length is not
known so it is variable. When this was discussed in the past, it was decided
not to bifurcate the expression scenario. Hence the spec is the way it
is. </smh>
<tk>
That helps. So your belief is that case a) is workable but case b) is not
because the number of elements in the array is not known. I don't think
that holds up under scrutiny.
In case b), the outputValueCalc expression cannot be evaluated until all
of the array has been received. So if the info set is received as an event
stream then the unparser must wait until the array completes before evaluating
the expression. Note that the 'count' ( or 'length' ) field cannot be serialized
until its value is known. But the array ( or variable-length field ) comes
*after* the count/length field. By the time unparsing of the array/field
begins, the value *is* known.
The implementation of case a) is actually less straightforward. If the
field is an array length and the value is less than the number of items
in the info set then I think the unparser must issue an error. If greater
then the unparser could output default values ( if available ) or delimiters
( if the parent sequence is a positional sequence with a delimiter ) or
else an error if neither are possible. More simply, the unparser could
simply insist that the value must correctly describe the data, and I think
that's a reasonable rule.
Similarly for the length. If the length of the *unparsed* value is greater
than the value in the info set then the unparser should issue an error.
If it is shorter and the field is simple then pad characters could be added.
But again, I think real-world usage of length counts suggests that padding
is unlikely to be wanted, and the unparser should simply insist on the
length field being correct.
</tk>
Here's a real example of such an expression (albeit with lengthKind 'explicit'
but the principle is the same):
dfdl:length="{xs:nonNegativeInteger(fn:floor((../Length
+ 1) div 2))}"
Alex brought up the case where the expression evaluates to 0. In a positional
sequence, would you still expect a delimiter for this case?
<tk>Yes, unless it is in the trailing optional region of the group
and SSP='trailingEmpty'. In a positional sequence, every delimiter must
be present until suppression begins ( if allowed )</tk>
If 'yes' then the resultant zero length string must be treated as the 'absent
representation' and ignored. If 'no' then is the sequence still positional?
<tk>I don't understand the point. Why would it not be the 'empty
representation'? Why must it be 'ignored' if it does happen to be the 'absent
representation'? What does 'ignored' mean?</tk>
<smh>The point is that the parser has been told there are 0 occurrences.
So it would be odd if the infoset ended up containing an occurrence, which
can happen if the normal nil/empty rules are followed. (Eg, nilValue=%ES;).
<tk>
If the occursCount expression evaluates to zero then the parser will not
attempt to parse even one occurrence of the array. That's the natural meaning
of 'zero occurrences'. So nothing would go into the info set apart from
the 'count' field. This is entirely consistent with my definition b) of
'positional' ( the identity of every delimited field is known before
parsing of the field begins ).
</tk>
Hence the 0 occurrence case must treat it as absent which means nothing
is added to the infoset. Take
the ISO8583 bitmap use case - if the bit is 0 we must not try to parse
anything at all for that element - it is totally absent.</smh>
<tk>Yes - that's exactly my point. The fact that there is ( or could
be ) a delimiter after the 'count' field is irrelevant.</tk>
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: Tim
Kimber/UK/IBM@IBMGB
To: dfdl-wg@ogf.org,
Date: 10/06/2014
21:22
Subject: [DFDL-WG]
Action 261
Sent by: dfdl-wg-bounces@ogf.org
Implied separatorSuppressionPolicy for occursCountKind 'expression ' (All)
10/6: Spec says it is 'never' (positional sequence) but you have to parse
to identify the position, so isn't that non-positional?
I think there are two alternative definitions of 'positional':
a) the identity of every delimited field is known before parsing of the
sequence group begins
b) the identity of every delimited field is known before parsing of the
field begins
As an implementer, b) is sufficient because it means that the parser never
needs to backtrack while parsing the group.
a) allows the field identities to be statically known, but that is less
important - it does not allow optimised extraction of a particular field
as would be the case for a fixed-length group ( the possibility of escaped
separators/terminators means that every character will need to be scanned
anyway ).
It may sound like a small point, but it affects two decisions
1. whether ock='expression' should be allowed within a positional sequence
group ( action 261 )
2. what the behaviour of the unparser should be w.r.t. ock='expression'.
My own feeling is that ock='expression' should be treated almost exactly
like ock='fixed', except that the calculated array length N is used instead
of maxOccurs.
- When parsing a positional sequence group it should cause N delimiters
to be expected for the array.
- When unparsing a positional sequence group it should cause N delimiters
to be written.
These rules are consistent and straightforward to describe and implement.
The current rule ( unparser outputs the occurrences that are in the info
set only ) allows the unparser to write a document that cannot be parsed
using the same schema.
regards,
Tim Kimber,
----- Forwarded by Tim Kimber/UK/IBM on 10/06/2014 20:34 -----
From: Steve
Hanson/UK/IBM@IBMGB
To: dfdl-wg@ogf.org,
Date: 10/06/2014
17:57
Subject: [DFDL-WG]
OGF DFDL WG Call Minutes 2014-06-10
Sent by: dfdl-wg-bounces@ogf.org
Please find minutes from the above call at http://redmine.ogf.org/dmsf_files/13263?download=
Regards
Steve Hanson
Architect, IBM DFDL,
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU