OK - that changes the picture - you need
an unordered sequence - not yet supported by DFDL, but coming soon. Until
then, I think a choice wrapped in a repeating element is needed, with element
branches for Subject, To, From and Unwanted (last) to hoover up anything
else.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
"Garriss Jr.,
James P." <jgarriss@mitre.org>
To:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>,
Date:
01/03/2013 15:25
Subject:
Re: [DFDL-WG]
Can I ignore data I don't want in DFDL?
Sent by:
dfdl-wg-bounces@ogf.org
> If
you don't know the initiators, but know what is coming next
I get your solution, and
it looks useful, but…email headers can arrive in any order (excepting
a couple headers, like Received, which are supposed to be first).
So I could get: To, From,
Keywords, Subject
Or I could get: Keywords,
To, Subject, From
Or I could get any other
combination.
If the order is not known
a priori, can I still use this approach?
How about something like
this (pseudo code)?
HeaderArray (0 to unbounded)
Sequence
To
From
Subject
UnwantedHeader (ref
to the Group)
/Sequence
/HeaderArray
Group (see the group
Steve defined below)
The problem, of course, is
that I don’t know what the Header keys will be. Can I have a whole
bunch of discriminators, one for every *allowed* header? IOW,
the discriminator is “any header that’s not one of the headers that I
want.”
From: Steve Hanson [mailto:smh@uk.ibm.com]
Sent: Friday, March 01, 2013 9:03 AM
To: Garriss Jr., James P.
Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org
Subject: Re: [DFDL-WG] Can I ignore data I don't want in DFDL?
The general solution in DFDL is to use the
combination of an optional repeating element inside a hidden group.
You need to be careful that this optional hidden element does not consume
the next piece of wanted data by mistake. If all the unwanted elements
have known initiators then you are ok. If you don't know the initiators,
but know what is coming next, then one approach is as follows:
<xs:complexType>
<xs:sequence>
<xs:element
name="From" type="NameType" dfdl:initiator="From:%WSP*;"
terminator="%NL;%WSP*;" />
<xs:element
name="To" type="NameType" dfdl:initiator="To:%WSP*;"
terminator="%NL;%WSP*;"/>
<xs:sequence
dfdl:hiddenGroupRef="UnwantedGroup" />
<xs:element
name="Subject" type="xs:string" dfdl:initiator="Subject:%WSP*;"
terminator="%NL;%WSP*;"/>
</xs:sequence>
</xs:complexType>
<xs:group name="UnwantedGroup>
<xs:sequence>
<xs:element
name="UnwantedHeaders" maxOccurs="unbounded" />
<xs:complexType>
<xs:sequence>
<xs:element
name="Unwanted" type="xs:string" terminator="%NL;%WSP*;">
<xsd:annotation><xsd:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:discriminator
test="{fn:not(fn:startWith("Subject:"))}"/>
</xsd:appinfo></xsd:annotation>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
The hidden loop should consume all header lines that do not start with
"Subject:" and stop when it reaches one that does.
I've used a terminator for the header lines, you may have used a separator
with separatorPolicy 'suppressed'. Either should work, but the terminator
gives you the opportunity to handle data where the final CRLF is missing
(via property dfdl:documentFinalTerminatorCanBeMissing).
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: "Garriss
Jr., James P." <jgarriss@mitre.org>
To: "dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>,
Date: 01/03/2013
13:15
Subject: [DFDL-WG]
Can I ignore data I don't want in DFDL?
Sent by: dfdl-wg-bounces@ogf.org
Suppose I am using DFDL to parse email headers. Suppose the RFC only
allows 3 headers: To, From, Subject. DFDL can handle this,
no problem.
But suppose I get an email that includes a 4th header, one I have not planned
for (i.e., have not included in the DFDL schema), don’t care about, and
don’t want in the infoset. Like so:
From: <john@doe.com>
To: <jane@doe.com>
Keywords: sales
<-- this line should be ignored!
Subject: Latest sales figures
Can DFDL handle this? Does it have a mechanism for allowing me to
ignore (and thus drop) data I haven’t planned for and don’t care about?--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU