The general solution in DFDL is to use
the combination of an optional repeating element inside a hidden group.
You need to be careful that this optional
hidden element does not consume the next piece of wanted data by mistake.
If all the unwanted elements have known initiators then you are ok. If
you don't know the initiators, but know what is coming next, then one approach
is as follows:
<xs:complexType>
<xs:sequence>
<xs:element name="From" type="NameType"
dfdl:initiator="From:%WSP*;" terminator="%NL;%WSP*;"
/>
<xs:element name="To" type="NameType"
dfdl:initiator="To:%WSP*;" terminator="%NL;%WSP*;"/>
<xs:sequence dfdl:hiddenGroupRef="UnwantedGroup"
/>
<xs:element name="Subject" type="xs:string"
dfdl:initiator="Subject:%WSP*;" terminator="%NL;%WSP*;"/>
</xs:sequence>
</xs:complexType>
<xs:group name="UnwantedGroup>
<xs:sequence>
<xs:element name="UnwantedHeaders"
maxOccurs="unbounded" />
<xs:complexType>
<xs:sequence>
<xs:element
name="Unwanted" type="xs:string" terminator="%NL;%WSP*;">
<xsd:annotation><xsd:appinfo source="http://www.ogf.org/dfdl/">
<dfdl:discriminator
test="{fn:not(fn:startWith("Subject:"))}"/>
</xsd:appinfo></xsd:annotation>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
The hidden loop should consume all header
lines that do not start with "Subject:" and stop when it reaches
one that does.
I've used a terminator for the header
lines, you may have used a separator with separatorPolicy 'suppressed'.
Either should work, but the terminator gives you the opportunity to handle
data where the final CRLF is missing (via property dfdl:documentFinalTerminatorCanBeMissing).
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
"Garriss Jr.,
James P." <jgarriss@mitre.org>
To:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>,
Date:
01/03/2013 13:15
Subject:
[DFDL-WG] Can
I ignore data I don't want in DFDL?
Sent by:
dfdl-wg-bounces@ogf.org
Suppose I am using DFDL to parse email
headers. Suppose the RFC only allows 3 headers: To, From, Subject.
DFDL can handle this, no problem.
But suppose I get an email that includes
a 4th header, one I have not planned for (i.e., have not included in the
DFDL schema), don’t care about, and don’t want in the infoset. Like
so:
From: <john@doe.com>
To: <jane@doe.com>
Keywords: sales
<-- this line
should be ignored!
Subject: Latest sales figures
Can DFDL handle this? Does it have
a mechanism for allowing me to ignore (and thus drop) data I haven’t planned
for and don’t care about?--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU