I've been trying to work out a way to do this in a single pass, but so far no luck. I think your conclusion is correct and it requires two passes.

The ability to handle formats like this is a goal of DFDL, it just didn't make the cut for 1.0 of the spec.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From: "Garriss Jr., James P." <jgarriss@mitre.org>
To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>,
Date: 07/06/2013 13:18
Subject: Re: [DFDL-WG] Ignore extraneous CRLF w/ space?
Sent by: dfdl-wg-bounces@ogf.org

I have received no response from this. :-(

Should I take this to mean that folding whitespace (FWS) in email headers is a problem that DFDL cannot handle? I think this is a reasonable conclusion, as FWS is conceptually identical to IMF comments, which we have also identified as a problem that DFDL cannot handle. I’m not trying to cast stones at DFDL, just trying to understand its limitations.

The solution for these two problems (and for encoded words as well) is for DFDL to support multiple passes over some (or maybe all) of the data. Is this a feature that is being considered for DFDL?

From: Garriss Jr., James P.
Sent: Wednesday, June 05, 2013 1:23 PM
To: dfdl-wg@ogf.org
Subject: RE: [DFDL-WG] Ignore extraneous CRLF w/ space?

Ok, so the good news is that I completely understand what you’re talking about now. Thanks for the example and explanation (with correction).

The bad news is that I don’t see how this helps me. IOW, I now have an “array” of “data” elements, but I need to validate the actual data. I should be breaking the header into “from” data, “by” data, “via” data, etc., with “date” data at the end. Something kinda like this:

<Received>
<Tokens>
<From>smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1]) </From>
<By>localhost (Postfix)</By>
<Via>Exchange Front-End Server webmail.afmc.af.mil ([131.28.34.85])</Via>
<With>SMTP</With>
<Id>0A8791F116E</Id>
<For><jgarriss@mitre.org></For>
</Tokens>
<DateTime>
<DateTimeStuff>
<DayOfTheWeek>Tue</DayOfTheWeek>
…more day time stuff here…
</DateTimeStuff>
</DateTime>
</Received>

I think we are saying this is another 2-pass problem. In other words, the data is this (only 1 CRLF at the end):

Received: from smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1]) by localhost (Postfix) via Exchange Front-End Server webmail.afmc.af.mil ([131.28.34.85]) with SMTP id 0A8791F116E for <jgarriss@mitre.org>; Tue, 4 Jun 2013 14:03:24 -0400 (EDT)

But IMF adds what it calls “folding whitespace” to break it into multiple lines (see http://tools.ietf.org/html/rfc5322#section-3.2.2), like this:

Received: from smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1])
by localhost (Postfix) via Exchange Front-End Server webmail.afmc.af.mil
([131.28.34.85]) with SMTP id 0A8791F116E for <jgarriss@mitre.org>; Tue,
4 Jun 2013 14:03:24 -0400 (EDT)

So DFDL needs 2 passes to correctly parse the data, one to remove the CRLFs and another to parse/validate the data. (If you recall, Mike, we’ve found that DFDL has the same problem with IMF comments and encoded words.) And DFDL can’t do multiple passes.

That right?

If that’s right, this is a serious problem, because *many* headers use folding whitespace (though Received is probably the most important one). It’s a pretty core concept for IMF.

From: Steve Hanson [mailto:smh@uk.ibm.com]
Sent: Wednesday, June 05, 2013 12:44 PM
To: Garriss Jr., James P.
Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org
Subject: Re: [DFDL-WG] Ignore extraneous CRLF w/ space?

Received: from smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1])
by localhost (Postfix) via Exchange Front-End Server webmail.afmc.af.mil
([131.28.34.85]) with SMTP id 0A8791F116E for <jgarriss@mitre.org>; Tue,
4 Jun 2013 14:03:24 -0400 (EDT)

<xs:element name="Received_Header" dfdl:initiator="Received:%WSP*;" dfdl:terminator="%CR;%LF">
<xs:complexType>
<xs:sequence dfdl:separator="%CR;%LF;%SP;" dfdl:separatorPosition="infix">
<xs:element name="data" type="xs:string" maxOccurs="unbounded" dfdl:lengthKind="delimited" />
</xs:sequence>
</xs:complexType>
</xs:element>

DFDL consumes the initiator then starts processing the content of the header as an array of records. The CR+LF+SP are consumed as the separator, because that is the longest match. The CR+LF (no SP) is consumed as the terminator of the header. Clearly that only works if there is no SP straight after the CR+LF for the first line of a header. So you don't need a discriminator.

You will have to stitch the data together post-parse. I guess you could make the sequence hidden and get DFDL to stitch together the data lines into one long string via an element with dfdl:inputValueCalc.

Ah - I think I see where Mike's earlier append to the mailing list was coming from ?

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From: "Garriss Jr., James P." <jgarriss@mitre.org>
To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>,
Date: 05/06/2013 16:25
Subject: Re: [DFDL-WG] Ignore extraneous CRLF w/ space?
Sent by: dfdl-wg-bounces@ogf.org

> Is the problem that the dfdl:terminator '%CR;%LF;' for the end of the header record is firing prematurely when it encounters the CRLF in the data?

Exactly.

> I would model the data as unbounded repeating records, and use a discriminator to distinguish the repeats from the next header.

Uh, could you repeat that in English? Maybe with a small example? I freely admit that I don’t understand what you just said. Thanks!

From: Steve Hanson [mailto:smh@uk.ibm.com]
Sent: Wednesday, June 05, 2013 5:21 AM
To: Garriss Jr., James P.
Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org
Subject: Re: [DFDL-WG] Ignore extraneous CRLF w/ space?

James

Is the problem that the dfdl:terminator '%CR;%LF;' for the end of the header record is firing prematurely when it encounters the CRLF in the data?

If so then I'm not sure that DFDL can ignore the extra %CR;%LF; without using an escape scheme - but there isn't an escape scheme to use.

I would model the data as unbounded repeating records, and use a discriminator to distinguish the repeats from the next header.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From: "Garriss Jr., James P." <jgarriss@mitre.org>
To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>,
Date: 04/06/2013 19:56
Subject: [DFDL-WG] Ignore extraneous CRLF w/ space?
Sent by: dfdl-wg-bounces@ogf.org

Long IMF headers, such as Received, can be wrapped onto the next line by using a CRLF and then a space. This example has 3 such wrappings:

Received: from smtpksrv1.mitre.org (localhost.localdomain [127.0.0.1])
by localhost (Postfix) via Exchange Front-End Server webmail.afmc.af.mil
([131.28.34.85]) with SMTP id 0A8791F116E for <jgarriss@mitre.org>; Tue,
4 Jun 2013 14:03:24 -0400 (EDT)

How do I get DFDL to ignore these wrappings? For most of the header, it’s not an issue, because I can use a lengthPattern to lookahead to the ; before the date starts. But once the date starts, I have no way of knowing when it ends, so I need to ignore any CRLF with a space.

TIA

--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU