OK so I think the motivating example can
be described as follows:
1) CSV style format
2) Only delimiters are separators
3) There are optional fields that occur
beyond the last required field *
4) Empty string is a considered a normal
value that needs preserving for such an optional field
5) Nil value is already being used for
something else **
* Otherwise you just make all fields
required and use a default value of empty string
** Otherwise you use a nil default value
of empty string.
IBM DFDL has been operating in a world
of CSV and other delimited formats for nearly 8 years, and I've not come
across this requirement in reality. There is usually no distinction
between an omitted value and empty string in CSV style formats where the
field is optional.
I would prefer that this was deferred
until DFDL 2.0. Meanwhile we can design the proposed new dfdlx:emptyElementParsePolicy
so it can be easily extended.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson <smh@uk.ibm.com>
Cc:
DFDL-WG <dfdl-wg@ogf.org>
Date:
27/09/2019 19:20
Subject:
Re: [DFDL-WG]
Problem: simple format that is impossible to model
Yes there is the nillable technique, but my simplified
example data format was too simplified.
in the real format that I derived my simple example from,
nillability is used for other purposes. In that format generally elements
are nillable with nilValue="%WSP*;-%WSP*;". That is, the
format needs to distinguish explicitly nilled values from string values,
including empty string values.
I know I'm not the only user who thinks one should be
able to model this simple data format without needing to use nillability.
For example, if you look at the CSV schema on DFDLSchemas
on github, the elements for rows of data are not nillable, even though
adjacent commas are routine in CSV files.
Of course a CSV schema for a fixed-number-of-columns doesn't
have a variable number of elements in the rows, so the elements in the
rows are all required, not optional. Still I think you can't tolerate
adjacent commas without using nillability if you want the data to both
parse and unparse. I have wanted to enhance the CSV schema on github
to show more variations on the CSV-like theme for a while, because I have
recently created many CSV-like data schemas, and a common theme to them
is that there are a variety of representations of nilled such as "N/A
none -" (these were human-created spreadsheet 'documents' exported
as CSV, not machine-generated CSV data sets), and in some of these
empty strings are legit "normal" values. I have had the good
fortune that these formats were parse-only, as they would not have faithfully
unparsed.
The problem ultimately boils down to there is no way in
DFDL to say "treat empty strings as just normal strings".The
use of initiators/terminators and dfdl:emptyValueDelimiterPolicy="both"
doesn't fix this, because that doesn't give you a NormalRep, it gives you
EmptyRep.
As well there is ambiguity in the spec between the sections
9.2.5 and 9.2.3 - 9.2.4, as to whether zero-length string/hexBinary with
no framing is NormalRep or EmptyRep.
The fact that we have a property named dfdl:emptyValueDelimiterPolicy
suggests that an element, regardless of type, is EmptyRep if the content
is zero length and the initiators/terminators match the EVDP policy.
That suggests that section 9.2.5 is simply incorrect -
a NormalRep cannot be zero-length for string or hexBinary if there is no
framing. Such would always be an EmptyRep.
That would leave the nillable mechanism as the only way
to deal with zero-length strings that need to be retained in the infoset.
While it is good to fix that ambiguity, I find this not
really an adequate solution. I can't deal with my slash-delimited format
that uses nillable for other purposes in any reasonable way. I need a way
to say "treat zero-length strings as normal values".
I suggest we modify the recently proposed dfdlx:emptyElementParsePolicy
property to encompass the added variation we need. So the values of the
property would be:
- treat zero-length for all types as AbsentRep always (we
were calling this "treatAsMissing", or "treatAsAbsent"
- this is the IBM DFDL behavior today as I understand it.)
- treat zero-length for all types as EmptyRep always (we
were calling this "treatAsEmpty" - this is the DFDL Spec behavior
as written today as revised by current errata and with the correction mentioned
above to remove the ambiguity.)
- treat zero-length for string/hexBinary as NormalRep, all
other types as EmptyRep (Suggest "treatAsNormalOrEmpty".
The rationale for this enum name is since the other types than string/hexBinary
can't have zero length NormalRep, they must be EmptyRep. I read the enum
name as "treat as NormalRep when possible otherwise treat as EmptyRep".
Another possible enum name might be "preferNormalToEmpty".)
Section
9.2.5 would be clarified to say that zero-length NormalRep is possible
for string/hexBinary if there is no framing and dfdlx:emptyElementParsePolicy
is 'treatAsNormalOrEmpty'.
Sections 14.2.2 and 14.2.3 may need a one-line clarification
added that when zero-length string/hexBinary is being treated as NormalRep,
then they are "normal" not "empty", and since they
are not EmptyRep suppression of zero-length and separators would not occur
for trailingEmpty, trailingEmptyStrict, or anyEmpty. (Which should be intuitive
given the enum names use the word "Empty")
It would/could be an SDE (or maybe warning) if this latter
"treatAsNormalOrEmpty" was specified for a potentially required
element (scalar or minOccurs > 0) of type string or hexBinary of variable
length (so possibly zero) with a default specified other than default="",
because such a default value could never be used, as zero length would
be considered NormalRep and so would not trigger use of the default value.
I.e., SDE like "Default value for element X can never be used because...."
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com
Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
On Fri, Sep 27, 2019 at 4:39 AM Steve Hanson <smh@uk.ibm.com>
wrote:
As there are no initiators or terminators,
and your example infoset calls everything 'field', I am assuming that the
element looks logically like:
<xs:element name="field" type="xs:string" minOccurs="0"
maxOccurs="unbounded" />
You want to preserve the position of the occurrences in the infoset so
that they re-appear on output. The agreed way to do this is:
<xs:element name="field" type="xs:string" minOccurs="0"
maxOccurs="unbounded" nillable="true" dfdl:nilKind="literalValue"
dfdl:nilValue="%ES;" />
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: DFDL-WG
<dfdl-wg@ogf.org>
Date: 26/09/2019
19:11
Subject: Re:
[DFDL-WG] Problem: simple format that is impossible to model
Sent by: "dfdl-wg"
<dfdl-wg-bounces@ogf.org>
To start discussion on my own issue.....
The problem here may be that for a string (or hexBinary), if there is no
initiator/terminator, there is no way to distinguish EmptyRep from NormalRep.
I.e., an empty string is a "normal" value for a string.
Sections 9.2.3 and 9.2.4 seem to define EmptyRep and NormalRep such that
an empty string will be a EmptyRep, not a NormalRep.
However section 9.2.5 on zero-length says:
"The normal representation can be a zero-length representation
if the type is xs:string or xs:hexBinary and there is no framing."
That suggests that when there is no framing, a zero-length string is NormalRep,
not EmptyRep, which is the opposite conclusion from what is in sections
9.2.3 and 9.2.4.
If this latter clarification is correct, then my format *should* work as
I expect, because the empty string elements will be considered NormalRep
and infoset values will be created for them.
It simply doesn't work because of a bug in daffodil which has not interpreted
this correctly.
...mikeb
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
On Thu, Sep 26, 2019 at 1:47 PM Mike Beckerle <mbeckerle.dfdl@gmail.com>
wrote:
I have a dead-simple little format:
data/data/data/data
data/data/data/data
it is lines of "/" separated strings. All elements are optional.
I simply want this:
data//data
to round trip. For that to happen I need it to parse into
<field>data</field><field></field><field>data</field>
That is, I require that empty field element in the middle to be created
and put into the infoset.
I can find no way to do this.
The strings have no initiator/terminator, so dfdl:emptyValueDelimiterPolicy
is not relevant. All the elements are optional, so default values aren't
relevant.
The spec states:
9.4.2.2 Simple
element (xs:string or xs:hexBinary)
Required occurrence: If the element has a default value then an item is
added to the infoset using the default value, otherwise an item is added
to the Infoset using empty string (type xs:string) or empty hexBinary (type
xs:hexBinary) as the value.
Optional occurrence: If dfdl:emptyValueDelimiterPolicy is not 'none'[12] then
an item is added to the Infoset using empty string (type xs:string) or
empty hexBinary (type xs:hexBinary) as the value, otherwise nothing
is added to the Infoset.
There are errata/actions to clarify wording here around dfdl:emptyValueDelimiterPolicy
being in effect or not (because there is no initiator/terminator for it
to use as opposed to the property in isolation just being 'none').
But that doesn't change anything about this issue.
If this very simple format is not possible, then we need a property or
new property enum value that makes it possible.
Thoughts?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU