I've been trying to work out a way to do
this in a single pass, but so far no luck. I think your conclusion is correct
and it requires two passes.
The ability to handle formats like this
is a goal of DFDL, it just didn't make the cut for 1.0 of the spec.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
"Garriss Jr.,
James P." <jgarriss@mitre.org>
To:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>,
Date:
07/06/2013 13:18
Subject:
Re: [DFDL-WG]
Ignore extraneous CRLF w/ space?
Sent by:
dfdl-wg-bounces@ogf.org
I have received no response
from this. :-(
Should I take this to mean
that folding whitespace (FWS) in email headers is a problem that DFDL cannot
handle? I think this is a reasonable conclusion, as FWS is conceptually
identical to IMF comments, which we have also identified as a problem that
DFDL cannot handle. I’m not trying to cast stones at DFDL, just
trying to understand its limitations.
The solution for these two
problems (and for encoded words as well) is for DFDL to support multiple
passes over some (or maybe all) of the data. Is this a feature that
is being considered for DFDL?
From: Garriss Jr., James P.
Sent: Wednesday, June 05, 2013 1:23 PM
To: dfdl-wg@ogf.org
Subject: RE: [DFDL-WG] Ignore extraneous CRLF w/ space?
Ok, so the good news is that
I completely understand what you’re talking about now. Thanks for
the example and explanation (with correction).
The bad news is that I don’t
see how this helps me. IOW, I now have an “array” of “data” elements,
but I need to validate the actual data. I should be breaking the
header into “from” data, “by” data, “via” data, etc., with “date”
data at the end. Something kinda like this:
<Received>
<Tokens>
<From>smtpksrv1.mitre.org
(localhost.localdomain [127.0.0.1]) </From>
<By>localhost
(Postfix)</By>
<Via>Exchange
Front-End Server webmail.afmc.af.mil ([131.28.34.85])</Via>
<With>SMTP</With>
<Id>0A8791F116E</Id>
<For><jgarriss@mitre.org></For>
</Tokens>
<DateTime>
<DateTimeStuff>
<DayOfTheWeek>Tue</DayOfTheWeek>
…more day
time stuff here…
</DateTimeStuff>
</DateTime>
</Received>
I think we are saying
this is another 2-pass problem. In other words, the data is this
(only 1 CRLF at the end):
Received: from smtpksrv1.mitre.org
(localhost.localdomain [127.0.0.1]) by localhost (Postfix) via Exchange
Front-End Server webmail.afmc.af.mil ([131.28.34.85]) with SMTP id 0A8791F116E
for <jgarriss@mitre.org>;
Tue, 4 Jun 2013 14:03:24 -0400 (EDT)
But IMF adds what it calls
“folding whitespace” to break it into multiple lines (see http://tools.ietf.org/html/rfc5322#section-3.2.2),
like this:
Received: from smtpksrv1.mitre.org
(localhost.localdomain [127.0.0.1])
by localhost (Postfix)
via Exchange Front-End Server webmail.afmc.af.mil
([131.28.34.85]) with
SMTP id 0A8791F116E for <jgarriss@mitre.org>;
Tue,
4 Jun 2013 14:03:24
-0400 (EDT)
So DFDL needs 2 passes to
correctly parse the data, one to remove the CRLFs and another to parse/validate
the data. (If you recall, Mike, we’ve found that DFDL has the same
problem with IMF comments and encoded words.) And DFDL can’t do
multiple passes.
That right?
If that’s right, this is
a serious problem, because *many* headers use folding whitespace
(though Received is probably the most important one). It’s a pretty
core concept for IMF.
From: Steve Hanson [mailto:smh@uk.ibm.com]
Sent: Wednesday, June 05, 2013 12:44 PM
To: Garriss Jr., James P.
Cc: dfdl-wg@ogf.org;
dfdl-wg-bounces@ogf.org
Subject: Re: [DFDL-WG] Ignore extraneous CRLF w/ space?
Received: from
smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1])
by localhost (Postfix) via Exchange Front-End Server webmail.afmc.af.mil
([131.28.34.85]) with SMTP id 0A8791F116E for <jgarriss@mitre.org>;
Tue,
4 Jun 2013 14:03:24 -0400 (EDT)
<xs:element name="Received_Header" dfdl:initiator="Received:%WSP*;"
dfdl:terminator="%CR;%LF">
<xs:complexType>
<xs:sequence dfdl:separator="%CR;%LF;%SP;" dfdl:separatorPosition="infix">
<xs:element name="data" type="xs:string"
maxOccurs="unbounded" dfdl:lengthKind="delimited" />
</xs:sequence>
</xs:complexType>
</xs:element>
DFDL consumes the initiator then starts processing the content
of the header as an array of records. The CR+LF+SP are consumed as the
separator, because that is the longest match. The CR+LF (no SP) is consumed
as the terminator of the header. Clearly that only works if there is no
SP straight after the CR+LF for the first line of a header. So you don't
need a discriminator.
You will have to stitch the data together post-parse. I guess you could
make the sequence hidden and get DFDL to stitch together the data lines
into one long string via an element with dfdl:inputValueCalc.
Ah - I think I see where Mike's earlier append to the mailing list was
coming from ?
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: "Garriss
Jr., James P." <jgarriss@mitre.org>
To: "dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>,
Date: 05/06/2013
16:25
Subject: Re:
[DFDL-WG] Ignore extraneous CRLF w/ space?
Sent by: dfdl-wg-bounces@ogf.org
> Is the problem that the dfdl:terminator
'%CR;%LF;' for the end of the header record is firing prematurely when
it encounters the CRLF in the data?
Exactly.
> I would model the data as unbounded
repeating records, and use a discriminator to distinguish the repeats from
the next header.
Uh, could you repeat that in English? Maybe with a small example?
I freely admit that I don’t understand what you just said. Thanks!
From: Steve Hanson [mailto:smh@uk.ibm.com]
Sent: Wednesday, June 05, 2013 5:21 AM
To: Garriss Jr., James P.
Cc: dfdl-wg@ogf.org;
dfdl-wg-bounces@ogf.org
Subject: Re: [DFDL-WG] Ignore extraneous CRLF w/ space?
James
Is the problem that the dfdl:terminator '%CR;%LF;' for the end of the header
record is firing prematurely when it encounters the CRLF in the data?
If so then I'm not sure that DFDL can ignore the extra %CR;%LF; without
using an escape scheme - but there isn't an escape scheme to use.
I would model the data as unbounded repeating records, and use a discriminator
to distinguish the repeats from the next header.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: "Garriss
Jr., James P." <jgarriss@mitre.org>
To: "dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>,
Date: 04/06/2013
19:56
Subject: [DFDL-WG]
Ignore extraneous CRLF w/ space?
Sent by: dfdl-wg-bounces@ogf.org
Long IMF headers, such as Received, can be wrapped onto the next line by
using a CRLF and then a space. This example has 3 such wrappings:
Received: from smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1])
by localhost (Postfix) via Exchange Front-End Server webmail.afmc.af.mil
([131.28.34.85]) with SMTP id 0A8791F116E for <jgarriss@mitre.org>;
Tue,
4 Jun 2013 14:03:24 -0400 (EDT)
How do I get DFDL to ignore these wrappings? For most of the header,
it’s not an issue, because I can use a lengthPattern to lookahead to the
; before the date starts. But once the date starts, I have no way
of knowing when it ends, so I need to ignore any CRLF with a space.
TIA
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU