Re: [DFDL-WG] Clarification needed: pad/trim and delimited length

From an email discussion several years ago, in the answer to Alan's question "Should we only look for escapeStartString at the beginning of
Decided on last DFDL WG call to leave the behaviour as currently specified, as it is possible to code the current behaviour. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Steve Hanson/UK/IBM To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org Date: 26/10/2012 13:39 Subject: Re: [DFDL-WG] Clarification needed: pad/trim and delimited length Mike For escape blocks, the escape start/end character must be the first/last character in the text. The order of processing stated in the spec was an attempt to handle the situation where the escape start/end character was not the first/last character in the text due to padding. So for examples like: Variable length: "aaa,aaaa" ,"bbbbbbb" ,"ccccccc" Fixed length: "aaa,aaaa" "bbbbbbb" "ccccccc" the data " Mike you said: "I'd prefer that we respect them anywhere, but canonical form when generated is at the beginning of the data. However, if we want to be more restrictive/conservative for v1.0 I'm fine with that." I tested IBM DFDL's implementation. The delimited example above (left justified) worked ok - the parser recognised the start quote and switched on escaping, correctly escaped the comma, then found the end quote and switched off escaping, then found the delimiter . With trimKind 'none' it issued an error to the effect that there was text between quote and next delimiter. With trimKind 'padChar' it worked ok and trimmed off the pad before going on to remove the quotes. However when the scenario was right-justified, it got it wrong, which I think is your point. The above order of processing leads to the following behaviour when trimKind is 'padChar'. Let's say I am exporting CSV data from Excel: Data: xx<sp><sp> Infoset: xx Data: "x,x<sp><sp>" Infoset: xx,<sp><sp> Explanation: The second data is same as the first except I have added in a comma, which causes Excel to escape with quotes in its normal way. The trimming takes place before escapes removed, so the first data loses the spaces while the second keeps them in the infoset. I don't think this is what a user would expect. (Note that Excel escapes the whole field). Seems to me there are competing requirements here, need to decide whether they all need to be satisfied by DFDL 1.0. Several possibilities to consider, here's some for starters: - Keep current rule but only allow it with left-justified fields - Keep current rule and trim 'as you go' rather than after extracting the data - Extend current rule and trim before and after escape character removal - Change rule to trim after escape character removal and handle leading/trailing text via delimiters - Change rule to trim after escape character removal and allow %WSP; etc in escape block start/end strings Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 25/10/2012 13:45 Subject: [DFDL-WG] Clarification needed: pad/trim and delimited length Sent by: dfdl-wg-bounces@ogf.org The spec says that pad characters are removed before escape scheme processing. However, in delimited context, I can't even determine the length of the field to trim off the padding unless I can do the escape scheme processing. This is either a Chicken-Egg, or the algorithm for parsing is substantially more complex due to padding. Comments? -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (1)
-
Steve Hanson