Mike, your 2nd solution has a problem with the
simple case when 3rd field is a single unquoted item - the List element
will consume C and D - unless you can put a discriminator on C to stop
it being parsed as a List - then it would work.
This even simpler solution suffers from the
same problem but can also be cured with a discriminator on C.
<xs:sequence dfdl:separator=', ,%WSP*;"
"%WSP*;,'>
<xs:element name="A"
type="xs:string"/>
<xs:element name="B"
type="xs:string"/>
<xs:sequence dfdl:separator=",">
<xs:element name="List"
type="xs:string" maxOccurs="unbounded"/>
</xs:sequence>
<xs:element name="C"
type="xs:string"/>
<xs:element name="D"
type="xs:string"/>
</xs:sequence>
An alternative would be to parse the 3rd
field as a string with the " as escape block start/end, but make it
hidden, then use an inputValueCalc expression to split the string and assign
each chunk to an occurrence of List. But that would mean relaxing the restriction
that inputValueCalc is not allowed on arrays.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson <smh@uk.ibm.com>
Cc:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date:
22/11/2017 17:41
Subject:
Re: [DFDL-WG]
how to trim inside of escape block?
Another related problem:
a, b, notList, c, d
a, b, "list1, list2, list3",c,d
Here the 3rd field is a list, comma separated. Quoted
if there is more than one list item.
I think to parse this I have to treat the quotation marks
as initiator/terminator, and set dfdl:separator="", but since
the quotes are optional for the single-list-item case, I'm going to need
a choice.
I think the best I can do is
<ignore:ListOf1__XMLSchemaMakesMeHaveThisForUPA/>
<List>notList</List>
and
<ignore:ListOfN__XMLSchemaMakesMeHaveThisForUPA/>
<List>list1</List><List>list2</List><List>list3</List>
as the XML representations.
Are there any better/cleaner solutions?
I did think of this way: (note: I've omitted xs:annotation
and xs:appinfo for brevity), but it isn't exactly "clean".
This is what I call "modeling syntax as data"....
<dfdl:defineVariable name="foundOpenQuote"
type="xs:boolean"/>
<xs:group name="optionalOpenQuote">
<choice>
<xs:sequence dfdl:initiiator='"'>
<dfdl:setVariable ref="foundOpenQuote" value="{ fn:true()
}"/>
</xs:sequence>
<xs:sequence dfdl:initiator=""/>
</choice>
</xs:group>
<xs:group name="matchingCloseQuote">
<choice>
<xs:sequence dfdl:terminator='"'>
<dfdl:discriminator>{ $foundOpenQuote eq fn:true() }</dfdl:assert>
</xs:sequence>
<xs:sequence />
</choice>
</xs:group>
// The main sequence for the data would then have this
as the list element:
<xs:sequence>
<dfdl:newVariableInstance ref="foundOpenQuote"
defaultValue="false"/>
<xs:sequence dfdl:hiddenGroupRef="optionalOpenQuote"/>
<xs:sequence dfdl:separator=",">
<xs:element name="List"
type="xs:string" maxOccurs="unbounded"/>
</xs:sequence>
<xs:sequence dfdl:hiddenGroupRef="matchingCloseQuote"/>
</xs:sequence>
I'd try this out, except that we haven't got dfdl:newVariableInstance
yet.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com
Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
On Wed, Nov 22, 2017 at 4:11 AM, Steve Hanson <smh@uk.ibm.com>
wrote:
I don't think there is a way to achieve
what you want. As you say, trimming pad chars takes precedence over applying
escape scheme.
I wondered if you could define the escapeBlockStart and End as "%WSP*;
and %WSP*;"
respectively but the white space entities are not allowed as escape character
or in escape block start/end.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: "dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date: 22/11/2017
01:28
Subject: [DFDL-WG]
how to trim inside of escape block?
Sent by: "dfdl-wg"
<dfdl-wg-bounces@ogf.org>
I have a CSV file
Some lines look like this
a,b," started with spaces, appearing right after the escape
block start ",c,d,e
I reviewed the spec, and I see that pad characters appear outside of the
quotation marks (escape block start/end).
What I'm trying to do is remove the whitespace after the escape block start,
and before the escape block end. This is just spurious whitespace, appears
because some of these CSV files were edited by people.
In my data the quoting characters are not always present. They are only
there if a comma appears in the data string.
Is there a technique for getting rid of the leading/trailing whitespace
inside the escape block start/end that I have forgotten?
...mikeb
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=vfzt-MyHajT591zYQmbcxckPT-mZLjNRPlTrg8kgRgY&s=6PDI_r_U7OUsqAxzv24ZiCuH5zPpWFyzXbneqH1GPXk&e=
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU