Can I ignore data I don't want in DFDL?

Suppose I am using DFDL to parse email headers. Suppose the RFC only allows 3 headers: To, From, Subject. DFDL can handle this, no problem. But suppose I get an email that includes a 4th header, one I have not planned for (i.e., have not included in the DFDL schema), don't care about, and don't want in the infoset. Like so: From: <john@doe.com> To: <jane@doe.com> Keywords: sales <-- this line should be ignored! Subject: Latest sales figures Can DFDL handle this? Does it have a mechanism for allowing me to ignore (and thus drop) data I haven't planned for and don't care about?

The general solution in DFDL is to use the combination of an optional repeating element inside a hidden group. You need to be careful that this optional hidden element does not consume the next piece of wanted data by mistake. If all the unwanted elements have known initiators then you are ok. If you don't know the initiators, but know what is coming next, then one approach is as follows: <xs:complexType> <xs:sequence> <xs:element name="From" type="NameType" dfdl:initiator="From:%WSP*;" terminator="%NL;%WSP*;" /> <xs:element name="To" type="NameType" dfdl:initiator="To:%WSP*;" terminator="%NL;%WSP*;"/> <xs:sequence dfdl:hiddenGroupRef="UnwantedGroup" /> <xs:element name="Subject" type="xs:string" dfdl:initiator="Subject:%WSP*;" terminator="%NL;%WSP*;"/> </xs:sequence> </xs:complexType> <xs:group name="UnwantedGroup> <xs:sequence> <xs:element name="UnwantedHeaders" maxOccurs="unbounded" /> <xs:complexType> <xs:sequence> <xs:element name="Unwanted" type="xs:string" terminator="%NL;%WSP*;"> <xsd:annotation><xsd:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:discriminator test="{fn:not(fn:startWith("Subject:"))}"/> </xsd:appinfo></xsd:annotation> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> The hidden loop should consume all header lines that do not start with "Subject:" and stop when it reaches one that does. I've used a terminator for the header lines, you may have used a separator with separatorPolicy 'suppressed'. Either should work, but the terminator gives you the opportunity to handle data where the final CRLF is missing (via property dfdl:documentFinalTerminatorCanBeMissing). Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: "Garriss Jr., James P." <jgarriss@mitre.org> To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, Date: 01/03/2013 13:15 Subject: [DFDL-WG] Can I ignore data I don't want in DFDL? Sent by: dfdl-wg-bounces@ogf.org Suppose I am using DFDL to parse email headers. Suppose the RFC only allows 3 headers: To, From, Subject. DFDL can handle this, no problem. But suppose I get an email that includes a 4th header, one I have not planned for (i.e., have not included in the DFDL schema), don’t care about, and don’t want in the infoset. Like so: From: <john@doe.com> To: <jane@doe.com> Keywords: sales <-- this line should be ignored! Subject: Latest sales figures Can DFDL handle this? Does it have a mechanism for allowing me to ignore (and thus drop) data I haven’t planned for and don’t care about?-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

If you don't know the initiators, but know what is coming next
I get your solution, and it looks useful, but…email headers can arrive in any order (excepting a couple headers, like Received, which are supposed to be first). So I could get: To, From, Keywords, Subject Or I could get: Keywords, To, Subject, From Or I could get any other combination. If the order is not known a priori, can I still use this approach? How about something like this (pseudo code)? HeaderArray (0 to unbounded) Sequence To From Subject UnwantedHeader (ref to the Group) /Sequence /HeaderArray Group (see the group Steve defined below) The problem, of course, is that I don’t know what the Header keys will be. Can I have a whole bunch of discriminators, one for every *allowed* header? IOW, the discriminator is “any header that’s not one of the headers that I want.” From: Steve Hanson [mailto:smh@uk.ibm.com] Sent: Friday, March 01, 2013 9:03 AM To: Garriss Jr., James P. Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org Subject: Re: [DFDL-WG] Can I ignore data I don't want in DFDL? The general solution in DFDL is to use the combination of an optional repeating element inside a hidden group. You need to be careful that this optional hidden element does not consume the next piece of wanted data by mistake. If all the unwanted elements have known initiators then you are ok. If you don't know the initiators, but know what is coming next, then one approach is as follows: <xs:complexType> <xs:sequence> <xs:element name="From" type="NameType" dfdl:initiator="From:%WSP*;" terminator="%NL;%WSP*;" /> <xs:element name="To" type="NameType" dfdl:initiator="To:%WSP*;" terminator="%NL;%WSP*;"/> <xs:sequence dfdl:hiddenGroupRef="UnwantedGroup" /> <xs:element name="Subject" type="xs:string" dfdl:initiator="Subject:%WSP*;" terminator="%NL;%WSP*;"/> </xs:sequence> </xs:complexType> <xs:group name="UnwantedGroup> <xs:sequence> <xs:element name="UnwantedHeaders" maxOccurs="unbounded" /> <xs:complexType> <xs:sequence> <xs:element name="Unwanted" type="xs:string" terminator="%NL;%WSP*;"> <xsd:annotation><xsd:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:discriminator test="{fn:not(fn:startWith("Subject:"))}"/> </xsd:appinfo></xsd:annotation> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> The hidden loop should consume all header lines that do not start with "Subject:" and stop when it reaches one that does. I've used a terminator for the header lines, you may have used a separator with separatorPolicy 'suppressed'. Either should work, but the terminator gives you the opportunity to handle data where the final CRLF is missing (via property dfdl:documentFinalTerminatorCanBeMissing). Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group<http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK smh@uk.ibm.com<mailto:smh@uk.ibm.com> tel:+44-1962-815848 From: "Garriss Jr., James P." <jgarriss@mitre.org<mailto:jgarriss@mitre.org>> To: "dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>" <dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>>, Date: 01/03/2013 13:15 Subject: [DFDL-WG] Can I ignore data I don't want in DFDL? Sent by: dfdl-wg-bounces@ogf.org<mailto:dfdl-wg-bounces@ogf.org> ________________________________ Suppose I am using DFDL to parse email headers. Suppose the RFC only allows 3 headers: To, From, Subject. DFDL can handle this, no problem. But suppose I get an email that includes a 4th header, one I have not planned for (i.e., have not included in the DFDL schema), don’t care about, and don’t want in the infoset. Like so: From: <john@doe.com<mailto:john@doe.com>> To: <jane@doe.com<mailto:jane@doe.com>> Keywords: sales <-- this line should be ignored! Subject: Latest sales figures Can DFDL handle this? Does it have a mechanism for allowing me to ignore (and thus drop) data I haven’t planned for and don’t care about?-- dfdl-wg mailing list dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org> https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

OK - that changes the picture - you need an unordered sequence - not yet supported by DFDL, but coming soon. Until then, I think a choice wrapped in a repeating element is needed, with element branches for Subject, To, From and Unwanted (last) to hoover up anything else. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: "Garriss Jr., James P." <jgarriss@mitre.org> To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, Date: 01/03/2013 15:25 Subject: Re: [DFDL-WG] Can I ignore data I don't want in DFDL? Sent by: dfdl-wg-bounces@ogf.org
If you don't know the initiators, but know what is coming next
I get your solution, and it looks useful, but…email headers can arrive in any order (excepting a couple headers, like Received, which are supposed to be first). So I could get: To, From, Keywords, Subject Or I could get: Keywords, To, Subject, From Or I could get any other combination. If the order is not known a priori, can I still use this approach? How about something like this (pseudo code)? HeaderArray (0 to unbounded) Sequence To From Subject UnwantedHeader (ref to the Group) /Sequence /HeaderArray Group (see the group Steve defined below) The problem, of course, is that I don’t know what the Header keys will be. Can I have a whole bunch of discriminators, one for every *allowed* header? IOW, the discriminator is “any header that’s not one of the headers that I want.” From: Steve Hanson [mailto:smh@uk.ibm.com] Sent: Friday, March 01, 2013 9:03 AM To: Garriss Jr., James P. Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org Subject: Re: [DFDL-WG] Can I ignore data I don't want in DFDL? The general solution in DFDL is to use the combination of an optional repeating element inside a hidden group. You need to be careful that this optional hidden element does not consume the next piece of wanted data by mistake. If all the unwanted elements have known initiators then you are ok. If you don't know the initiators, but know what is coming next, then one approach is as follows: <xs:complexType> <xs:sequence> <xs:element name="From" type="NameType" dfdl:initiator="From:%WSP*;" terminator="%NL;%WSP*;" /> <xs:element name="To" type="NameType" dfdl:initiator="To:%WSP*;" terminator="%NL;%WSP*;"/> <xs:sequence dfdl:hiddenGroupRef="UnwantedGroup" /> <xs:element name="Subject" type="xs:string" dfdl:initiator="Subject:%WSP*;" terminator="%NL;%WSP*;"/> </xs:sequence> </xs:complexType> <xs:group name="UnwantedGroup> <xs:sequence> <xs:element name="UnwantedHeaders" maxOccurs="unbounded" /> <xs:complexType> <xs:sequence> <xs:element name="Unwanted" type="xs:string" terminator="%NL;%WSP*;"> <xsd:annotation><xsd:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:discriminator test="{fn:not(fn:startWith("Subject:"))}"/> </xsd:appinfo></xsd:annotation> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> The hidden loop should consume all header lines that do not start with "Subject:" and stop when it reaches one that does. I've used a terminator for the header lines, you may have used a separator with separatorPolicy 'suppressed'. Either should work, but the terminator gives you the opportunity to handle data where the final CRLF is missing (via property dfdl:documentFinalTerminatorCanBeMissing). Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: "Garriss Jr., James P." <jgarriss@mitre.org> To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, Date: 01/03/2013 13:15 Subject: [DFDL-WG] Can I ignore data I don't want in DFDL? Sent by: dfdl-wg-bounces@ogf.org Suppose I am using DFDL to parse email headers. Suppose the RFC only allows 3 headers: To, From, Subject. DFDL can handle this, no problem. But suppose I get an email that includes a 4th header, one I have not planned for (i.e., have not included in the DFDL schema), don’t care about, and don’t want in the infoset. Like so: From: <john@doe.com> To: <jane@doe.com> Keywords: sales <-- this line should be ignored! Subject: Latest sales figures Can DFDL handle this? Does it have a mechanism for allowing me to ignore (and thus drop) data I haven’t planned for and don’t care about?-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

That makes sense. And, yes, I should have a choice, not a sequence, in my pseudo code. From: Steve Hanson [mailto:smh@uk.ibm.com] Sent: Friday, March 01, 2013 10:53 AM To: Garriss Jr., James P. Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org Subject: Re: [DFDL-WG] Can I ignore data I don't want in DFDL? OK - that changes the picture - you need an unordered sequence - not yet supported by DFDL, but coming soon. Until then, I think a choice wrapped in a repeating element is needed, with element branches for Subject, To, From and Unwanted (last) to hoover up anything else. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group<http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK smh@uk.ibm.com<mailto:smh@uk.ibm.com> tel:+44-1962-815848 From: "Garriss Jr., James P." <jgarriss@mitre.org<mailto:jgarriss@mitre.org>> To: "dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>" <dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>>, Date: 01/03/2013 15:25 Subject: Re: [DFDL-WG] Can I ignore data I don't want in DFDL? Sent by: dfdl-wg-bounces@ogf.org<mailto:dfdl-wg-bounces@ogf.org> ________________________________
If you don't know the initiators, but know what is coming next
I get your solution, and it looks useful, but…email headers can arrive in any order (excepting a couple headers, like Received, which are supposed to be first). So I could get: To, From, Keywords, Subject Or I could get: Keywords, To, Subject, From Or I could get any other combination. If the order is not known a priori, can I still use this approach? How about something like this (pseudo code)? HeaderArray (0 to unbounded) Sequence To From Subject UnwantedHeader (ref to the Group) /Sequence /HeaderArray Group (see the group Steve defined below) The problem, of course, is that I don’t know what the Header keys will be. Can I have a whole bunch of discriminators, one for every *allowed* header? IOW, the discriminator is “any header that’s not one of the headers that I want.” From: Steve Hanson [mailto:smh@uk.ibm.com] Sent: Friday, March 01, 2013 9:03 AM To: Garriss Jr., James P. Cc: dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>; dfdl-wg-bounces@ogf.org<mailto:dfdl-wg-bounces@ogf.org> Subject: Re: [DFDL-WG] Can I ignore data I don't want in DFDL? The general solution in DFDL is to use the combination of an optional repeating element inside a hidden group. You need to be careful that this optional hidden element does not consume the next piece of wanted data by mistake. If all the unwanted elements have known initiators then you are ok. If you don't know the initiators, but know what is coming next, then one approach is as follows: <xs:complexType> <xs:sequence> <xs:element name="From" type="NameType" dfdl:initiator="From:%WSP*;" terminator="%NL;%WSP*;" /> <xs:element name="To" type="NameType" dfdl:initiator="To:%WSP*;" terminator="%NL;%WSP*;"/> <xs:sequence dfdl:hiddenGroupRef="UnwantedGroup" /> <xs:element name="Subject" type="xs:string" dfdl:initiator="Subject:%WSP*;" terminator="%NL;%WSP*;"/> </xs:sequence> </xs:complexType> <xs:group name="UnwantedGroup> <xs:sequence> <xs:element name="UnwantedHeaders" maxOccurs="unbounded" /> <xs:complexType> <xs:sequence> <xs:element name="Unwanted" type="xs:string" terminator="%NL;%WSP*;"> <xsd:annotation><xsd:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:discriminator test="{fn:not(fn:startWith("Subject:"))}"/> </xsd:appinfo></xsd:annotation> </xs:element> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> The hidden loop should consume all header lines that do not start with "Subject:" and stop when it reaches one that does. I've used a terminator for the header lines, you may have used a separator with separatorPolicy 'suppressed'. Either should work, but the terminator gives you the opportunity to handle data where the final CRLF is missing (via property dfdl:documentFinalTerminatorCanBeMissing). Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group<http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK smh@uk.ibm.com<mailto:smh@uk.ibm.com> tel:+44-1962-815848 From: "Garriss Jr., James P." <jgarriss@mitre.org<mailto:jgarriss@mitre.org>> To: "dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>" <dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>>, Date: 01/03/2013 13:15 Subject: [DFDL-WG] Can I ignore data I don't want in DFDL? Sent by: dfdl-wg-bounces@ogf.org<mailto:dfdl-wg-bounces@ogf.org> ________________________________ Suppose I am using DFDL to parse email headers. Suppose the RFC only allows 3 headers: To, From, Subject. DFDL can handle this, no problem. But suppose I get an email that includes a 4th header, one I have not planned for (i.e., have not included in the DFDL schema), don’t care about, and don’t want in the infoset. Like so: From: <john@doe.com<mailto:john@doe.com>> To: <jane@doe.com<mailto:jane@doe.com>> Keywords: sales <-- this line should be ignored! Subject: Latest sales figures Can DFDL handle this? Does it have a mechanism for allowing me to ignore (and thus drop) data I haven’t planned for and don’t care about?-- dfdl-wg mailing list dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org> https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU-- dfdl-wg mailing list dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org> https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (2)
-
Garriss Jr., James P.
-
Steve Hanson