Agree with the rewording for the 'isolated
ES or WSP*' cases.
I'm not convinced that we need to allow
ES and WSP* in isolation for separators. When I present on DFDL, I get
asked when to use a separator or terminator. My answer is use a separator
when the delimiter is always the same between occurrences. which implies
it is a property of the sequence rather than each element. Allowing ES
or WSP* is breaking that, to my mind. I'm also not sure what effect it
has on separator suppression, which is complicated enough as it is. So
I'd prefer to leave separator wording as it is.
I'd rather not introduce the concept
of scanning for initiators. The parser does not really scan for initiators,
it expects to find an initiator at the current offset. The initiatedContent
property was added to allow a) the schema to be checked to ensure all children
had an initiator, and b) initiators to be discriminators. (Scanning for
separator/terminator is different as the parser really does have to scan
through bytes to find a separator/terminator.)
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson <smh@uk.ibm.com>
Cc:
DFDL-WG <dfdl-wg@ogf.org>
Date:
09/10/2018 18:25
Subject:
Re: [DFDL-WG]
Clarification needed: sequence terminator that exists or not depending
on expression
Comments inline.
...mikeb
On Tue, Oct 9, 2018 at 9:41 AM Steve Hanson <smh@uk.ibm.com>
wrote:
"%ES;%ES;"
is already disallowed, as ES can only appear once - see the entity syntax
table.
"%ES; %ES;" is also
disallowed, it contravenes the first sentence "ES
must not appear as the only DFDL string literal in the property."
It appears twice, but it is still
the only DFDL string literal :) The wording is clearly ambiguous
as we interpreted it differently.
Suggest rewording as: "Neither '%ES;' nor '%WSP*;'
may appear as an isolated string literal in the property value, or in the
value returned from an expression when scanning for delimiters."
Note that IBM DFDL has not yet implemented the erratum (2.148) that allows
ES to appear anywhere other then dfdl:nilvalue. (All started from this
public comment https://redmine.ogf.org/boards/15/topics/40)
IBM has also encountered this type of "variable-length-with-max"
string. I'm sure I raised it in the WG a long time ago, and we discussed
(and presumably rejected) whether it should be a new lengthKind, eg "delimitedMax",
for convenience. Can't find anything in my email logs though. And not sure
what we did to model it ?? My memory could be playing tricks.
I don't want to add a length kind for this. I want to
be able to use delimiters both when scanning for terminating markup, and
when not doing so, and have what is allowed in terminating markup be different
for the two cases, based on whether lengthKind='delimited' applies anywhere
the delimiters are in scope.
We already have this language in the DFDL spec. i.e.,
designed to work this way, it's just not complete and consistent.
Whatever we decide, each of initiator, terminator and separator need to
be considered separately. Note that ES is currently allowed (with
stated restrictions) for initiator and terminator only, not for separator
- which makes sense to me but is contrary to 2.148 ??
Also must be wary of EVDP.
And NVDP also.
Separators can also be used when NOT scanning for terminating
markup. E.g., a sequence of 10 fixed-length strings can have comma separators.
No scanning is used for them, as each child is just 10 long exactly, and
then the separator must be found. In this case having %ES; as one of the
string literals just means there may or may not be found any of the separators,
i.e., they are optional.
I went and re-read 2.148, the trackers for the public
comment, etc.
We just need a crisp and complete definition of what scanning
for delimiters means.
There are two cases:
Case 1: Scanning for initiators
We are scanning for an initiator when initiatedContent="yes"
and we are parsing the
* children of a choice group
* children of an unordered sequence group
* children of a sequence group having floating="yes"
When scanning for an initiator, an initiator must be defined
and in-effect.
This means when the child (per above) is
* an element where the value can be empty, EVDP must be
initiator or both along with an initiator being defined.
* a nillable element, NVDP must be initiator or both along
with an an initiator being defined.
This whole EVDP/NVDP discussion is probably unnecessary
if we just say "initiator must be in-effect".
In other cases we're not scanning for initiators.
Case 2: Scanning for length
We are scanning for length when lengthKind='delimited'
and we are parsing an element.
Section 12.3.2 describes this, though it doesn't discuss
details of determining length of the nil representation. This section could
be improved, but I'm not really worried about that right now.
So scanning for delimiters is either scanning for initiators
or scanning for length. In that case, none of the in-scope terminating
delimiters can be %ES; nor %WSP*; in isolation.
So this suggests in summary:
* section 12.2 phrasing of constraints on %ES; and WSP*
must be improved to be clearer and less ambiguous for initiator and terminator.
* section 14.2 (definition of separator property) needs
updating to match that of terminator. The terminator property specifies
both ES; and WSP* entities are not allowed if scanning for delimiters.
Separator needs to be the same.
* Section 12.2 description of initiator needs to say that
%ES; and %WSP* in isolation are not allowed if scanning for initiators.
* A new 12.2 sub-section should be added that defines
"scanning for initiators", and section should be referenced from
the description of initiator property.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: DFDL-WG
<dfdl-wg@ogf.org>
Date: 01/10/2018
20:31
Subject: [DFDL-WG]
Clarification needed: sequence terminator that exists or not depending
on expression
Sent by: "dfdl-wg"
<dfdl-wg-bounces@ogf.org>
Consider the following:
<element name="value" type="xs:string" ...../>
<sequence dfdl:terminator="{ if (fn:string-length(./value) eq 32)
then '%ES;' else '%NUL;' }"/>
This is used to add a NUL at the end of a string, if the string length
is less than the max length of 32. This comes up often in fixed length
or variable-length-with-max data we've seen. I've put this terminator on
a separate sequence after the element to emphasize that we're not scanning
for terminating markup here. This has nothing to do with lengthKind 'delimited'.
However, the DFDL spec says (for terminator property)
·
ES must not appear as the only DFDL string literal
in the property. It can only appear as a member of a list.
·
Neither the ES entity nor the WSP* entity may appear
on their own as one of the string literals in the list when the parser
is determining the length of a component by scanning for delimiters.
The second bullet doesn't apply to my example.
Re: first bullet, I think my terminator expression is illegal... because
the '%ES;' is a list of literals containing ES as the only DFDL string
literal.
But this is a really flawed constraint, as "%ES;%ES;" and "%ES;
%ES;" both skirt the constraint, but mean the same thing as just "%ES;"
which is illegal.
So, if we don't want to allow these hack workarounds, we need a statement
that says runs of %ES; adjacent mean the same thing as one %ES;, and that
more than one identical-meaning delimiter specified in a list of string
literals means the same as just one. Or we can make these hack workarounds
illegal.
However, why are we disallowing these?
The above construct in my example is very useful, and really hard to work
around unless we can have a terminator that is '%ES;' as the only string
literal. Actually I have no work around for this really. I am guessing
I could come up with something, but the various things I've guessed at
don't pan out, or prevent the string named 'value' above from being modeled
as a simple type.
I know we don't want lengthKind='delimited' with terminator="%ES;"
as that is most likely just a schema-definition error, but when we're not
dealing with a lengthKind, we really do seem to need to specify situations
where conditionally the terminator region will be empty.
So I think we need to do:
1) clarify that %ES; cannot be used in combination with any other character
or entity as a member of a list of string literals.
1a) At the same time I would also disallow combinations of
WSP* that are misleading and unnecessary i.e., disallow %WSP*; adjacent
to any other WSP, WSP+, or WSP*.
2) clarify that the constraint that %ES; for terminator and separator cannot
appear as the only string literal in a list of string literals... applies
only when the parser is determining the length of a component by scanning
for delimiters. This is just rephrasing the two bullets above so the clause
about scanning applies to both, not just the second.
I believe this preserves the intent that when lengthKind="delimited"
and we are scanning for delimiters, there must be *some* delimiter that
is potentially not zero length. You still have to cope with the possible
match being zero length due to %ES; being in the list of terminating markup,
or WSP* similarly, with no whitespace found. But the notion that there
is NO scanning to be done can't happen. That is, the notion that the schema
specifies lengthKind delimited, but also specifies no delimiters at all,
is still ruled out.
Comments?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU