Decided on last DFDL WG call to leave the
behaviour as currently specified, as it is possible to code the current
behaviour.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
Steve Hanson/UK/IBM
To:
Mike Beckerle <mbeckerle.dfdl@gmail.com>,
Cc:
dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org
Date:
26/10/2012 13:39
Subject:
Re: [DFDL-WG]
Clarification needed: pad/trim and delimited length
Mike
For escape blocks, the escape start/end
character must be the first/last character in the text. The order of processing
stated in the spec was an attempt to handle the situation where the escape
start/end character was not the first/last character in the text
due to padding. So for examples like:
Variable length: "aaa,aaaa"
,"bbbbbbb" ,"ccccccc"
Fixed length: "aaa,aaaa"
"bbbbbbb" "ccccccc"
From an email discussion several years
ago, in the answer to Alan's question "Should we only look for
escapeStartString at the beginning of the data "
Mike you said: "I'd prefer
that we respect them anywhere, but canonical form when generated is at
the beginning of the data. However, if we want to be more restrictive/conservative
for v1.0 I'm fine with that."
I tested IBM DFDL's implementation.
The delimited example above (left justified) worked ok - the parser recognised
the start quote and switched on escaping, correctly escaped the comma,
then found the end quote and switched off escaping, then found the delimiter
. With trimKind 'none' it issued an error to the effect that there was
text between quote and next delimiter. With trimKind 'padChar' it worked
ok and trimmed off the pad before going on to remove the quotes. However
when the scenario was right-justified, it got it wrong, which I think is
your point.
The above order of processing leads
to the following behaviour when trimKind is 'padChar'. Let's say I am exporting
CSV data from Excel:
Data: xx<sp><sp>
Infoset: xx
Data: "x,x<sp><sp>"
Infoset: xx,<sp><sp>
Explanation: The second data is same
as the first except I have added in a comma, which causes Excel to escape
with quotes in its normal way. The trimming takes place before escapes
removed, so the first data loses the spaces while the second keeps them
in the infoset. I don't think this is what a user would expect. (Note that
Excel escapes the whole field).
Seems to me there are competing requirements
here, need to decide whether they all need to be satisfied by DFDL 1.0.
Several possibilities to consider, here's some for starters:
- Keep current rule but only allow it
with left-justified fields
- Keep current rule and trim 'as you
go' rather than after extracting the data
- Extend current rule and trim before
and after escape character removal
- Change rule to trim after escape character
removal and handle leading/trailing text via delimiters
- Change rule to trim after escape character
removal and allow %WSP; etc in escape block start/end strings
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
dfdl-wg@ogf.org,
Date:
25/10/2012 13:45
Subject:
[DFDL-WG] Clarification
needed: pad/trim and delimited length
Sent by:
dfdl-wg-bounces@ogf.org
The spec says that pad characters are removed before escape scheme processing.
However, in delimited context, I can't even determine the length of the
field to trim off the padding unless I can do the escape scheme processing.
This is either a Chicken-Egg, or the algorithm for parsing is substantially
more complex due to padding.
Comments?
--
Mike Beckerle | OGF DFDL WG Co-Chair
Tel: 781-330-0412
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU