Problem: simple format that is impossible to model
I have a dead-simple little format: data/data/data/data data/data/data/data it is lines of "/" separated strings. All elements are optional. I simply want this: data//data to round trip. For that to happen I need it to parse into <field>data</field><field></field><field>data</field> That is, I require that empty field element in the middle to be created and put into the infoset. I can find no way to do this. The strings have no initiator/terminator, so dfdl:emptyValueDelimiterPolicy is not relevant. All the elements are optional, so default values aren't relevant. The spec states: 9.4.2.2 Simple element (xs:string or xs:hexBinary) Required occurrence: If the element has a default value then an item is added to the infoset using the default value, otherwise an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value. Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none'[12] https://daffodil.apache.org/docs/dfdl/#_ftn12 then an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value, *otherwise nothing is added to the Infoset*. There are errata/actions to clarify wording here around dfdl:emptyValueDelimiterPolicy being in effect or not (because there is no initiator/terminator for it to use as opposed to the property in isolation just being 'none'). But that doesn't change anything about this issue. If this very simple format is not possible, then we need a property or new property enum value that makes it possible. Thoughts? Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy http://www.ogf.org/About/abt_policies.php
To start discussion on my own issue.....
The problem here may be that for a string (or hexBinary), if there is no
initiator/terminator, there is no way to distinguish EmptyRep from
NormalRep. I.e., an empty string is a "normal" value for a string.
Sections 9.2.3 and 9.2.4 seem to define EmptyRep and NormalRep such that an
empty string will be a EmptyRep, not a NormalRep.
However section 9.2.5 on zero-length says:
"The normal representation can be a zero-length representation if the
type is xs:string or xs:hexBinary and there is no framing."
That suggests that when there is no framing, a zero-length string is
NormalRep, not EmptyRep, which is the opposite conclusion from what is in
sections 9.2.3 and 9.2.4.
If this latter clarification is correct, then my format *should* work as I
expect, because the empty string elements will be considered NormalRep and
infoset values will be created for them.
It simply doesn't work because of a bug in daffodil which has not
interpreted this correctly.
...mikeb
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
http://www.ogf.org/About/abt_policies.php
On Thu, Sep 26, 2019 at 1:47 PM Mike Beckerle
I have a dead-simple little format:
data/data/data/data data/data/data/data
it is lines of "/" separated strings. All elements are optional.
I simply want this:
data//data
to round trip. For that to happen I need it to parse into
<field>data</field><field></field><field>data</field>
That is, I require that empty field element in the middle to be created and put into the infoset.
I can find no way to do this.
The strings have no initiator/terminator, so dfdl:emptyValueDelimiterPolicy is not relevant. All the elements are optional, so default values aren't relevant.
The spec states:
9.4.2.2 Simple element (xs:string or xs:hexBinary)
Required occurrence: If the element has a default value then an item is added to the infoset using the default value, otherwise an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value.
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none'[12] https://daffodil.apache.org/docs/dfdl/#_ftn12 then an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value, *otherwise nothing is added to the Infoset*.
There are errata/actions to clarify wording here around dfdl:emptyValueDelimiterPolicy being in effect or not (because there is no initiator/terminator for it to use as opposed to the property in isolation just being 'none'). But that doesn't change anything about this issue.
If this very simple format is not possible, then we need a property or new property enum value that makes it possible.
Thoughts?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy http://www.ogf.org/About/abt_policies.php
As there are no initiators or terminators, and your example infoset calls
everything 'field', I am assuming that the element looks logically like:
Yes there is the nillable technique, but my simplified example data format
was too simplified.
in the real format that I derived my simple example from, nillability is
used for other purposes. In that format generally elements are nillable
with nilValue="%WSP*;-%WSP*;". That is, the format needs to distinguish
explicitly nilled values from string values, including empty string values.
I know I'm not the only user who thinks one should be able to model this
simple data format without needing to use nillability.
For example, if you look at the CSV schema on DFDLSchemas on github, the
elements for rows of data are not nillable, even though adjacent commas are
routine in CSV files.
Of course a CSV schema for a fixed-number-of-columns doesn't have a
variable number of elements in the rows, so the elements in the rows are
all required, not optional. Still I think you can't tolerate adjacent
commas without using nillability if you want the data to both parse and
unparse. I have wanted to enhance the CSV schema on github to show more
variations on the CSV-like theme for a while, because I have recently
created many CSV-like data schemas, and a common theme to them is that
there are a variety of representations of nilled such as "N/A none -"
(these were human-created spreadsheet 'documents' exported as CSV, not
machine-generated CSV data sets), and in some of these empty strings are
legit "normal" values. I have had the good fortune that these formats were
parse-only, as they would not have faithfully unparsed.
The problem ultimately boils down to there is no way in DFDL to say "treat
empty strings as just normal strings".The use of initiators/terminators and
dfdl:emptyValueDelimiterPolicy="both" doesn't fix this, because that
doesn't give you a NormalRep, it gives you EmptyRep.
As well there is ambiguity in the spec between the sections 9.2.5 and 9.2.3
- 9.2.4, as to whether zero-length string/hexBinary with no framing is
NormalRep or EmptyRep.
The fact that we have a property named dfdl:emptyValueDelimiterPolicy
suggests that an element, regardless of type, is EmptyRep if the content is
zero length and the initiators/terminators match the EVDP policy.
That suggests that section 9.2.5 is simply incorrect - a NormalRep cannot
be zero-length for string or hexBinary if there is no framing. Such would
always be an EmptyRep.
That would leave the nillable mechanism as the only way to deal with
zero-length strings that need to be retained in the infoset.
While it is good to fix that ambiguity, I find this not really an adequate
solution. I can't deal with my slash-delimited format that uses nillable
for other purposes in any reasonable way. I need a way to say "treat
zero-length strings as normal values".
I suggest we modify the recently proposed dfdlx:emptyElementParsePolicy
property to encompass the added variation we need. So the values of the
property would be:
- treat zero-length for all types as AbsentRep always (we were calling
this "treatAsMissing", or "treatAsAbsent" - this is the IBM DFDL behavior
today as I understand it.)
- treat zero-length for all types as EmptyRep always (we were calling
this "treatAsEmpty" - this is the DFDL Spec behavior as written today as
revised by current errata and with the correction mentioned above to remove
the ambiguity.)
- treat zero-length for string/hexBinary as NormalRep, all other types
as EmptyRep (Suggest "treatAsNormalOrEmpty". The rationale for this enum
name is since the other types than string/hexBinary can't have zero length
NormalRep, they must be EmptyRep. I read the enum name as "treat as
NormalRep when possible otherwise treat as EmptyRep". Another possible
enum name might be "preferNormalToEmpty".)
Section 9.2.5 would be clarified to say that zero-length NormalRep is
possible for string/hexBinary if there is no framing and
dfdlx:emptyElementParsePolicy is 'treatAsNormalOrEmpty'.
Sections 14.2.2 and 14.2.3 may need a one-line clarification added that
when zero-length string/hexBinary is being treated as NormalRep, then they
are "normal" not "empty", and since they are not EmptyRep suppression of
zero-length and separators would not occur for trailingEmpty,
trailingEmptyStrict, or anyEmpty. (Which should be intuitive given the enum
names use the word "Empty")
It would/could be an SDE (or maybe warning) if this latter
"treatAsNormalOrEmpty" was specified for a potentially required element
(scalar or minOccurs > 0) of type string or hexBinary of variable length
(so possibly zero) with a default specified other than default="", because
such a default value could never be used, as zero length would be
considered NormalRep and so would not trigger use of the default value.
I.e., SDE like "Default value for element X can never be used because...."
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
http://www.ogf.org/About/abt_policies.php
On Fri, Sep 27, 2019 at 4:39 AM Steve Hanson
As there are no initiators or terminators, and your example infoset calls everything 'field', I am assuming that the element looks logically like:
You want to preserve the position of the occurrences in the infoset so that they re-appear on output. The agreed way to do this is:
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK Architect, *IBM DFDL* http://www.ibm.com/developerworks/library/se-dfdl/index.html Co-Chair, *OGF DFDL Working Group* http://www.ogf.org/dfdl/ *smh@uk.ibm.com*
tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Mike Beckerle
To: DFDL-WG Date: 26/09/2019 19:11 Subject: Re: [DFDL-WG] Problem: simple format that is impossible to model Sent by: "dfdl-wg" ------------------------------ To start discussion on my own issue.....
The problem here may be that for a string (or hexBinary), if there is no initiator/terminator, there is no way to distinguish EmptyRep from NormalRep. I.e., an empty string is a "normal" value for a string.
Sections 9.2.3 and 9.2.4 seem to define EmptyRep and NormalRep such that an empty string will be a EmptyRep, not a NormalRep.
However section 9.2.5 on zero-length says:
"The normal representation can be a zero-length representation if the type is xs:string or xs:hexBinary and there is no framing."
That suggests that when there is no framing, a zero-length string is NormalRep, not EmptyRep, which is the opposite conclusion from what is in sections 9.2.3 and 9.2.4.
If this latter clarification is correct, then my format *should* work as I expect, because the empty string elements will be considered NormalRep and infoset values will be created for them. It simply doesn't work because of a bug in daffodil which has not interpreted this correctly.
...mikeb
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *www.tresys.com* http://www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the *OGF Intellectual Property Policy* http://www.ogf.org/About/abt_policies.php
On Thu, Sep 26, 2019 at 1:47 PM Mike Beckerle <*mbeckerle.dfdl@gmail.com*
> wrote: I have a dead-simple little format: data/data/data/data data/data/data/data
it is lines of "/" separated strings. All elements are optional.
I simply want this:
data//data
to round trip. For that to happen I need it to parse into
<field>data</field><field></field><field>data</field>
That is, I require that empty field element in the middle to be created and put into the infoset.
I can find no way to do this.
The strings have no initiator/terminator, so dfdl:emptyValueDelimiterPolicy is not relevant. All the elements are optional, so default values aren't relevant.
The spec states:
9.4.2.2 Simple element (xs:string or xs:hexBinary) Required occurrence: If the element has a default value then an item is added to the infoset using the default value, otherwise an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value. Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none'*[12]* https://daffodil.apache.org/docs/dfdl/#_ftn12 then an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value, *otherwise nothing is added to the Infoset*.
There are errata/actions to clarify wording here around dfdl:emptyValueDelimiterPolicy being in effect or not (because there is no initiator/terminator for it to use as opposed to the property in isolation just being 'none'). But that doesn't change anything about this issue.
If this very simple format is not possible, then we need a property or new property enum value that makes it possible.
Thoughts?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *www.tresys.com* http://www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the *OGF Intellectual Property Policy* http://www.ogf.org/About/abt_policies.php -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
OK so I think the motivating example can be described as follows:
1) CSV style format
2) Only delimiters are separators
3) There are optional fields that occur beyond the last required field *
4) Empty string is a considered a normal value that needs preserving for
such an optional field
5) Nil value is already being used for something else **
* Otherwise you just make all fields required and use a default value of
empty string
** Otherwise you use a nil default value of empty string.
IBM DFDL has been operating in a world of CSV and other delimited formats
for nearly 8 years, and I've not come across this requirement in reality.
There is usually no distinction between an omitted value and empty string
in CSV style formats where the field is optional.
I would prefer that this was deferred until DFDL 2.0. Meanwhile we can
design the proposed new dfdlx:emptyElementParsePolicy so it can be easily
extended.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike Beckerle
To be clear, the example really is real. It actually comes from a format
called USMTF which is US mil-std-6040, NATO STANAG 5500.
I am ok to leave this until DFDL 2.0 and do experimental implementations in
the mean time.
We still have a bug then in DFDL spec section 9.2.5 where it suggests
normal rep for string/hexBinary can be zero-length if there is no framing.
This is simply false I believe. ZL for a string or hexBinary has to be
empty rep or nil rep.
-mike beckerle
On Tue, Oct 1, 2019 at 10:14 AM Steve Hanson
OK so I think the motivating example can be described as follows:
1) CSV style format 2) Only delimiters are separators 3) There are optional fields that occur beyond the last required field * 4) Empty string is a considered a normal value that needs preserving for such an optional field 5) Nil value is already being used for something else **
* Otherwise you just make all fields required and use a default value of empty string ** Otherwise you use a nil default value of empty string.
IBM DFDL has been operating in a world of CSV and other delimited formats for nearly 8 years, and I've not come across this requirement in reality. There is usually no distinction between an omitted value and empty string in CSV style formats where the field is optional.
I would prefer that this was deferred until DFDL 2.0. Meanwhile we can design the proposed new dfdlx:emptyElementParsePolicy so it can be easily extended.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK Architect, *IBM DFDL* http://www.ibm.com/developerworks/library/se-dfdl/index.html Co-Chair, *OGF DFDL Working Group* http://www.ogf.org/dfdl/ *smh@uk.ibm.com*
tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Mike Beckerle
To: Steve Hanson Cc: DFDL-WG Date: 27/09/2019 19:20 Subject: Re: [DFDL-WG] Problem: simple format that is impossible to model ------------------------------ Yes there is the nillable technique, but my simplified example data format was too simplified.
in the real format that I derived my simple example from, nillability is used for other purposes. In that format generally elements are nillable with nilValue="%WSP*;-%WSP*;". That is, the format needs to distinguish explicitly nilled values from string values, including empty string values.
I know I'm not the only user who thinks one should be able to model this simple data format without needing to use nillability.
For example, if you look at the CSV schema on DFDLSchemas on github, the elements for rows of data are not nillable, even though adjacent commas are routine in CSV files. Of course a CSV schema for a fixed-number-of-columns doesn't have a variable number of elements in the rows, so the elements in the rows are all required, not optional. Still I think you can't tolerate adjacent commas without using nillability if you want the data to both parse and unparse. I have wanted to enhance the CSV schema on github to show more variations on the CSV-like theme for a while, because I have recently created many CSV-like data schemas, and a common theme to them is that there are a variety of representations of nilled such as "N/A none -" (these were human-created spreadsheet 'documents' exported as CSV, not machine-generated CSV data sets), and in some of these empty strings are legit "normal" values. I have had the good fortune that these formats were parse-only, as they would not have faithfully unparsed.
The problem ultimately boils down to there is no way in DFDL to say "treat empty strings as just normal strings".The use of initiators/terminators and dfdl:emptyValueDelimiterPolicy="both" doesn't fix this, because that doesn't give you a NormalRep, it gives you EmptyRep.
As well there is ambiguity in the spec between the sections 9.2.5 and 9.2.3 - 9.2.4, as to whether zero-length string/hexBinary with no framing is NormalRep or EmptyRep.
The fact that we have a property named dfdl:emptyValueDelimiterPolicy suggests that an element, regardless of type, is EmptyRep if the content is zero length and the initiators/terminators match the EVDP policy. That suggests that section 9.2.5 is simply incorrect - a NormalRep cannot be zero-length for string or hexBinary if there is no framing. Such would always be an EmptyRep. That would leave the nillable mechanism as the only way to deal with zero-length strings that need to be retained in the infoset.
While it is good to fix that ambiguity, I find this not really an adequate solution. I can't deal with my slash-delimited format that uses nillable for other purposes in any reasonable way. I need a way to say "treat zero-length strings as normal values".
I suggest we modify the recently proposed dfdlx:emptyElementParsePolicy property to encompass the added variation we need. So the values of the property would be:
- treat zero-length for all types as AbsentRep always (we were calling this "treatAsMissing", or "treatAsAbsent" - this is the IBM DFDL behavior today as I understand it.) - treat zero-length for all types as EmptyRep always (we were calling this "treatAsEmpty" - this is the DFDL Spec behavior as written today as revised by current errata and with the correction mentioned above to remove the ambiguity.) - treat zero-length for string/hexBinary as NormalRep, all other types as EmptyRep (Suggest "treatAsNormalOrEmpty". The rationale for this enum name is since the other types than string/hexBinary can't have zero length NormalRep, they must be EmptyRep. I read the enum name as "treat as NormalRep when possible otherwise treat as EmptyRep". Another possible enum name might be "preferNormalToEmpty".)
Section 9.2.5 would be clarified to say that zero-length NormalRep is possible for string/hexBinary if there is no framing and dfdlx:emptyElementParsePolicy is 'treatAsNormalOrEmpty'.
Sections 14.2.2 and 14.2.3 may need a one-line clarification added that when zero-length string/hexBinary is being treated as NormalRep, then they are "normal" not "empty", and since they are not EmptyRep suppression of zero-length and separators would not occur for trailingEmpty, trailingEmptyStrict, or anyEmpty. (Which should be intuitive given the enum names use the word "Empty")
It would/could be an SDE (or maybe warning) if this latter "treatAsNormalOrEmpty" was specified for a potentially required element (scalar or minOccurs > 0) of type string or hexBinary of variable length (so possibly zero) with a default specified other than default="", because such a default value could never be used, as zero length would be considered NormalRep and so would not trigger use of the default value. I.e., SDE like "Default value for element X can never be used because...."
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *www.tresys.com* http://www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the *OGF Intellectual Property Policy* http://www.ogf.org/About/abt_policies.php
On Fri, Sep 27, 2019 at 4:39 AM Steve Hanson <*smh@uk.ibm.com*
> wrote: As there are no initiators or terminators, and your example infoset calls everything 'field', I am assuming that the element looks logically like:
You want to preserve the position of the occurrences in the infoset so that they re-appear on output. The agreed way to do this is:
Regards
Steve Hanson IBM Hybrid Integration, Hursley, UK Architect, *IBM DFDL* http://www.ibm.com/developerworks/library/se-dfdl/index.html Co-Chair, *OGF DFDL Working Group* http://www.ogf.org/dfdl/ *smh@uk.ibm.com*
tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Mike Beckerle <*mbeckerle.dfdl@gmail.com*
> To: DFDL-WG <*dfdl-wg@ogf.org* > Date: 26/09/2019 19:11 Subject: Re: [DFDL-WG] Problem: simple format that is impossible to model Sent by: "dfdl-wg" <*dfdl-wg-bounces@ogf.org* > ------------------------------ To start discussion on my own issue.....
The problem here may be that for a string (or hexBinary), if there is no initiator/terminator, there is no way to distinguish EmptyRep from NormalRep. I.e., an empty string is a "normal" value for a string.
Sections 9.2.3 and 9.2.4 seem to define EmptyRep and NormalRep such that an empty string will be a EmptyRep, not a NormalRep.
However section 9.2.5 on zero-length says:
"The normal representation can be a zero-length representation if the type is xs:string or xs:hexBinary and there is no framing."
That suggests that when there is no framing, a zero-length string is NormalRep, not EmptyRep, which is the opposite conclusion from what is in sections 9.2.3 and 9.2.4.
If this latter clarification is correct, then my format *should* work as I expect, because the empty string elements will be considered NormalRep and infoset values will be created for them. It simply doesn't work because of a bug in daffodil which has not interpreted this correctly.
...mikeb
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *www.tresys.com* http://www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the *OGF Intellectual Property Policy* http://www.ogf.org/About/abt_policies.php
On Thu, Sep 26, 2019 at 1:47 PM Mike Beckerle <*mbeckerle.dfdl@gmail.com*
> wrote: I have a dead-simple little format: data/data/data/data data/data/data/data
it is lines of "/" separated strings. All elements are optional.
I simply want this:
data//data
to round trip. For that to happen I need it to parse into
<field>data</field><field></field><field>data</field>
That is, I require that empty field element in the middle to be created and put into the infoset.
I can find no way to do this.
The strings have no initiator/terminator, so dfdl:emptyValueDelimiterPolicy is not relevant. All the elements are optional, so default values aren't relevant.
The spec states:
9.4.2.2 Simple element (xs:string or xs:hexBinary) Required occurrence: If the element has a default value then an item is added to the infoset using the default value, otherwise an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value. Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none'*[12]* https://daffodil.apache.org/docs/dfdl/#_ftn12 then an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value, *otherwise nothing is added to the Infoset*.
There are errata/actions to clarify wording here around dfdl:emptyValueDelimiterPolicy being in effect or not (because there is no initiator/terminator for it to use as opposed to the property in isolation just being 'none'). But that doesn't change anything about this issue.
If this very simple format is not possible, then we need a property or new property enum value that makes it possible.
Thoughts?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *www.tresys.com* http://www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the *OGF Intellectual Property Policy* http://www.ogf.org/About/abt_policies.php -- dfdl-wg mailing list *dfdl-wg@ogf.org*
*https://www.ogf.org/mailman/listinfo/dfdl-wg* https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Firstly, it is possible in general for a zero-length value (empty string)
to be a normal rep. Example. I have an initiator and a terminator. My
nilValueDelimiterPolicy and emptyValueDelimiterPolicy state that only the
terminator is present. The data contains initiator, empty string,
terminator. This rep with empty string can therefore only be normal rep. A
technicality, but it is allowed by the spec.
The text you are questioning says:
"If the nil and empty representations can not be zero-length, but the
normal representation may be zero length then the absent representation
cannot occur because zero length will be interpreted as a normal
representation."
This is citing a specific example of the general case from my first
paragraph. However, is it actually possible? Zero-length normal rep
implies no framing present in the data so no initiator or terminator in
the data. If nil rep and empty rep are not zero-length reps, then they
must have initiator and/or terminator defined. So is it possible to define
an element so that it has an initiator and/or terminator, yet zero-length
rep is legal, given that initiator and terminator must appear? I think
yes it is, for example, WSP* is allowable as a terminator value, or
documentFinalTerminatorCanBeMissing can be 'yes'. So, a technicality
again, but allowed.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike Beckerle
Those are interesting technicalities. I'll try to keep them in mind.
But the sentence in 9.2.5 that I am trying to reconcile is actually prior
to the one you listed:
The normal representation can be a zero-length representation if the
type is xs:string or xs:hexBinary and there is no framing.
So, how to interpret "there is no framing"?
I was interpreting as no framing *possible* as in the schema statically is
defined so that there is no framing, i.e, dfdl:initiator=""
dfdl:terminator="", no alignment regions, etc.
By this interpretation I believe the statement from 9.2.5 is just
incorrect. If no framing is defined, there is no way to distinguish Empty
from Normal for Strings, so we go with Empty.
The other interpretation of "there is no framing" is like this: "while
non-zero framing is possible, no framing is actually present in this
particular occurrence in the data stream".
This covers cases like dfdl:terminator="%WSP*;" where it can match zero or
more characters.
But I still am unable to construct an example where
* the framing is declared in the schema and it can be zero-length framing
or greater than zero, so it is acceptable that none is present
* no framing is present
* the value is also zero-length
* where it is clear we should get NormalRep not EmptyRep.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
http://www.ogf.org/About/abt_policies.php
On Wed, Oct 2, 2019 at 4:14 AM Steve Hanson
Firstly, it is possible in general for a zero-length *value* (empty string) to be a normal rep. Example. I have an initiator and a terminator. My nilValueDelimiterPolicy and emptyValueDelimiterPolicy state that only the terminator is present. The data contains initiator, empty string, terminator. This rep with empty string can therefore only be normal rep. A technicality, but it is allowed by the spec.
The text you are questioning says:
"If the nil and empty representations can not be zero-length, but the normal representation may be zero length then the absent representation cannot occur because zero length will be interpreted as a normal representation."
This is citing a specific example of the general case from my first paragraph. However, is it actually possible? Zero-length normal rep implies no framing present in the data so no initiator or terminator in the data. If nil rep and empty rep are not zero-length reps, then they must have initiator and/or terminator defined. So is it possible to define an element so that it has an initiator and/or terminator, yet zero-length rep is legal, given that initiator and terminator must appear? I think yes it is, for example, WSP* is allowable as a terminator value, or documentFinalTerminatorCanBeMissing can be 'yes'. So, a technicality again, but allowed.
Regards Steve Hanson
IBM Hybrid Integration, Hursley, UK Architect, *IBM DFDL* http://www.ibm.com/developerworks/library/se-dfdl/index.html Co-Chair, *OGF DFDL Working Group* http://www.ogf.org/dfdl/ *smh@uk.ibm.com*
tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Mike Beckerle
To: Steve Hanson Cc: DFDL-WG Date: 01/10/2019 23:36 Subject: Re: [DFDL-WG] Problem: simple format that is impossible to model ------------------------------ To be clear, the example really is real. It actually comes from a format called USMTF which is US mil-std-6040, NATO STANAG 5500.
I am ok to leave this until DFDL 2.0 and do experimental implementations in the mean time.
We still have a bug then in DFDL spec section 9.2.5 where it suggests normal rep for string/hexBinary can be zero-length if there is no framing. This is simply false I believe. ZL for a string or hexBinary has to be empty rep or nil rep.
-mike beckerle
On Tue, Oct 1, 2019 at 10:14 AM Steve Hanson <*smh@uk.ibm.com*
> wrote: OK so I think the motivating example can be described as follows: 1) CSV style format 2) Only delimiters are separators 3) There are optional fields that occur beyond the last required field * 4) Empty string is a considered a normal value that needs preserving for such an optional field 5) Nil value is already being used for something else **
* Otherwise you just make all fields required and use a default value of empty string ** Otherwise you use a nil default value of empty string.
IBM DFDL has been operating in a world of CSV and other delimited formats for nearly 8 years, and I've not come across this requirement in reality. There is usually no distinction between an omitted value and empty string in CSV style formats where the field is optional.
I would prefer that this was deferred until DFDL 2.0. Meanwhile we can design the proposed new dfdlx:emptyElementParsePolicy so it can be easily extended.
Regards
Steve Hanson IBM Hybrid Integration, Hursley, UK Architect, *IBM DFDL* http://www.ibm.com/developerworks/library/se-dfdl/index.html Co-Chair, *OGF DFDL Working Group* http://www.ogf.org/dfdl/ *smh@uk.ibm.com*
tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Mike Beckerle <*mbeckerle.dfdl@gmail.com*
> To: Steve Hanson <*smh@uk.ibm.com* > Cc: DFDL-WG <*dfdl-wg@ogf.org* > Date: 27/09/2019 19:20 Subject: Re: [DFDL-WG] Problem: simple format that is impossible to model ------------------------------ Yes there is the nillable technique, but my simplified example data format was too simplified.
in the real format that I derived my simple example from, nillability is used for other purposes. In that format generally elements are nillable with nilValue="%WSP*;-%WSP*;". That is, the format needs to distinguish explicitly nilled values from string values, including empty string values.
I know I'm not the only user who thinks one should be able to model this simple data format without needing to use nillability.
For example, if you look at the CSV schema on DFDLSchemas on github, the elements for rows of data are not nillable, even though adjacent commas are routine in CSV files. Of course a CSV schema for a fixed-number-of-columns doesn't have a variable number of elements in the rows, so the elements in the rows are all required, not optional. Still I think you can't tolerate adjacent commas without using nillability if you want the data to both parse and unparse. I have wanted to enhance the CSV schema on github to show more variations on the CSV-like theme for a while, because I have recently created many CSV-like data schemas, and a common theme to them is that there are a variety of representations of nilled such as "N/A none -" (these were human-created spreadsheet 'documents' exported as CSV, not machine-generated CSV data sets), and in some of these empty strings are legit "normal" values. I have had the good fortune that these formats were parse-only, as they would not have faithfully unparsed.
The problem ultimately boils down to there is no way in DFDL to say "treat empty strings as just normal strings".The use of initiators/terminators and dfdl:emptyValueDelimiterPolicy="both" doesn't fix this, because that doesn't give you a NormalRep, it gives you EmptyRep.
As well there is ambiguity in the spec between the sections 9.2.5 and 9.2.3 - 9.2.4, as to whether zero-length string/hexBinary with no framing is NormalRep or EmptyRep.
The fact that we have a property named dfdl:emptyValueDelimiterPolicy suggests that an element, regardless of type, is EmptyRep if the content is zero length and the initiators/terminators match the EVDP policy. That suggests that section 9.2.5 is simply incorrect - a NormalRep cannot be zero-length for string or hexBinary if there is no framing. Such would always be an EmptyRep. That would leave the nillable mechanism as the only way to deal with zero-length strings that need to be retained in the infoset.
While it is good to fix that ambiguity, I find this not really an adequate solution. I can't deal with my slash-delimited format that uses nillable for other purposes in any reasonable way. I need a way to say "treat zero-length strings as normal values".
I suggest we modify the recently proposed dfdlx:emptyElementParsePolicy property to encompass the added variation we need. So the values of the property would be:
- treat zero-length for all types as AbsentRep always (we were calling this "treatAsMissing", or "treatAsAbsent" - this is the IBM DFDL behavior today as I understand it.) - treat zero-length for all types as EmptyRep always (we were calling this "treatAsEmpty" - this is the DFDL Spec behavior as written today as revised by current errata and with the correction mentioned above to remove the ambiguity.) - treat zero-length for string/hexBinary as NormalRep, all other types as EmptyRep (Suggest "treatAsNormalOrEmpty". The rationale for this enum name is since the other types than string/hexBinary can't have zero length NormalRep, they must be EmptyRep. I read the enum name as "treat as NormalRep when possible otherwise treat as EmptyRep". Another possible enum name might be "preferNormalToEmpty".)
Section 9.2.5 would be clarified to say that zero-length NormalRep is possible for string/hexBinary if there is no framing and dfdlx:emptyElementParsePolicy is 'treatAsNormalOrEmpty'.
Sections 14.2.2 and 14.2.3 may need a one-line clarification added that when zero-length string/hexBinary is being treated as NormalRep, then they are "normal" not "empty", and since they are not EmptyRep suppression of zero-length and separators would not occur for trailingEmpty, trailingEmptyStrict, or anyEmpty. (Which should be intuitive given the enum names use the word "Empty")
It would/could be an SDE (or maybe warning) if this latter "treatAsNormalOrEmpty" was specified for a potentially required element (scalar or minOccurs > 0) of type string or hexBinary of variable length (so possibly zero) with a default specified other than default="", because such a default value could never be used, as zero length would be considered NormalRep and so would not trigger use of the default value. I.e., SDE like "Default value for element X can never be used because...."
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *www.tresys.com* http://www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the *OGF Intellectual Property Policy* http://www.ogf.org/About/abt_policies.php
On Fri, Sep 27, 2019 at 4:39 AM Steve Hanson <*smh@uk.ibm.com*
> wrote: As there are no initiators or terminators, and your example infoset calls everything 'field', I am assuming that the element looks logically like:
You want to preserve the position of the occurrences in the infoset so that they re-appear on output. The agreed way to do this is:
Regards
Steve Hanson IBM Hybrid Integration, Hursley, UK Architect, *IBM DFDL* http://www.ibm.com/developerworks/library/se-dfdl/index.html Co-Chair, *OGF DFDL Working Group* http://www.ogf.org/dfdl/ *smh@uk.ibm.com*
tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Mike Beckerle <*mbeckerle.dfdl@gmail.com*
> To: DFDL-WG <*dfdl-wg@ogf.org* > Date: 26/09/2019 19:11 Subject: Re: [DFDL-WG] Problem: simple format that is impossible to model Sent by: "dfdl-wg" <*dfdl-wg-bounces@ogf.org* > ------------------------------ To start discussion on my own issue.....
The problem here may be that for a string (or hexBinary), if there is no initiator/terminator, there is no way to distinguish EmptyRep from NormalRep. I.e., an empty string is a "normal" value for a string.
Sections 9.2.3 and 9.2.4 seem to define EmptyRep and NormalRep such that an empty string will be a EmptyRep, not a NormalRep.
However section 9.2.5 on zero-length says:
"The normal representation can be a zero-length representation if the type is xs:string or xs:hexBinary and there is no framing."
That suggests that when there is no framing, a zero-length string is NormalRep, not EmptyRep, which is the opposite conclusion from what is in sections 9.2.3 and 9.2.4.
If this latter clarification is correct, then my format *should* work as I expect, because the empty string elements will be considered NormalRep and infoset values will be created for them. It simply doesn't work because of a bug in daffodil which has not interpreted this correctly.
...mikeb
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *www.tresys.com* http://www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the *OGF Intellectual Property Policy* http://www.ogf.org/About/abt_policies.php
On Thu, Sep 26, 2019 at 1:47 PM Mike Beckerle <*mbeckerle.dfdl@gmail.com*
> wrote: I have a dead-simple little format: data/data/data/data data/data/data/data
it is lines of "/" separated strings. All elements are optional.
I simply want this:
data//data
to round trip. For that to happen I need it to parse into
<field>data</field><field></field><field>data</field>
That is, I require that empty field element in the middle to be created and put into the infoset.
I can find no way to do this.
The strings have no initiator/terminator, so dfdl:emptyValueDelimiterPolicy is not relevant. All the elements are optional, so default values aren't relevant.
The spec states:
9.4.2.2 Simple element (xs:string or xs:hexBinary) Required occurrence: If the element has a default value then an item is added to the infoset using the default value, otherwise an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value. Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none'*[12]* https://daffodil.apache.org/docs/dfdl/#_ftn12 then an item is added to the Infoset using empty string (type xs:string) or empty hexBinary (type xs:hexBinary) as the value, *otherwise nothing is added to the Infoset*.
There are errata/actions to clarify wording here around dfdl:emptyValueDelimiterPolicy being in effect or not (because there is no initiator/terminator for it to use as opposed to the property in isolation just being 'none'). But that doesn't change anything about this issue.
If this very simple format is not possible, then we need a property or new property enum value that makes it possible.
Thoughts?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *www.tresys.com* http://www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the *OGF Intellectual Property Policy* http://www.ogf.org/About/abt_policies.php -- dfdl-wg mailing list *dfdl-wg@ogf.org*
*https://www.ogf.org/mailman/listinfo/dfdl-wg* https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Ah I see - what does 'no framing' mean. I had assumed it referred to the
data stream, as per the glossary definition ...
Framing - The term used to describe the delimiters, length fields, and
other parts of the data stream which are present, and may be necessary to
determine the length or position of the content of an element.
If It had said 'no framing properties' then I would have gone with your
interpretation.
But I still am unable to construct an example where
* the framing is declared in the schema and it can be zero-length framing
or greater than zero, so it is acceptable that none is present
* no framing is present
* the value is also zero-length
* where it is clear we should get NormalRep not EmptyRep.
The example I had in mind is this.
dfdl:terminator="xxx %WSP*;" dfdl:initiator="aaa"
dfdl:emptyValueDelimiterPolicy="both".
So I was thinking that data "aaa" would not match empty rep as "xxx" not
present. But of course "WSP*" is present, so we do have a terminator, so
get a match on empty rep. Normal rep not possible.
If I change the example to...
dfdl:terminator="xxx" dfdl:initiator="aaa"
dfdl:emptyValueDelimiterPolicy="both"
dfdl:documentFinalTerminatorCanBeMissing="yes".
and the data is "aaa<eof>" then as the spec stands I think that would be
normal rep. But I actually think that is wrong and the spec should be
clarified to say that the final terminator is logically present. Otherwise
the (arbitrary) presence or not of the final terminator changes the rep,
and that does not seem right.
I'd need some more time to think on further examples.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike Beckerle
participants (2)
-
Mike Beckerle
-
Steve Hanson