
I ran into an interesting format today called RINEX that is problematic for DFDL v1.0: http://acc.igs.org/misc/rinex304.pdf This format has a multi-line header with various kinds of header lines which are 80 characters, but characters 1 to 60 are data, and 61 to 80 are the "label", and you need the label in order to know how to parse the data of each header line. This either requires deep (and slow) speculation, or some sort of look ahead feature. E.g., See page 69 of that spec document for an example. The headers appear not only at the start of the file, but at the start of each data block. We have run into the need for short, fixed-distance look ahead before in other formats. In the case of RINEX the lookahead is exactly 60 characters. In other formats it is also a very short distance, but one which is tricky to figure out as the spec doesn't say exactly how big the region to be looked past is. One would have to figure it out from the data format spec. It's always a constant of course. Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com

Hi Mike Happy New Year. Perhaps I am missing something, but for this format I would use a discriminator with testKind="pattern" and a regex that skips past the first 60 bytes and looks at the next 20. A pattern discriminator is our way of peeking ahead at the data. Regards Steve Hanson IBM Integration, Hursley, UK Architect, IBM DFDL Co-Chair, OGF DFDL Working Group smh@uk.ibm.com<mailto:smh@uk.ibm.com> tel:+44-7717-378890 Note: I work Tuesday to Friday -----Original Message----- From: Mike Beckerle <mbeckerle@apache.org<mailto:Mike%20Beckerle%20%3cmbeckerle@apache.org%3e>> Reply-To: mbeckerle@apache.org<mailto:mbeckerle@apache.org> To: DFDL-WG <dfdl-wg@ogf.org<mailto:DFDL-WG%20%3cdfdl-wg@ogf.org%3e>> Subject: [EXTERNAL] [DFDL-WG] interesting format: RINEX Date: Tue, 20 Dec 2022 09:23:26 -0500 I ran into an interesting format today called RINEX that is problematic for DFDL v1. 0: http: //acc. igs. org/misc/rinex304. pdf This format has a multi-line header with various kinds of header lines which are 80 characters, but characters 1 to ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd I ran into an interesting format today called RINEX that is problematic for DFDL v1.0: http://acc.igs.org/misc/rinex304.pdf<http://acc.igs.org/misc/rinex304.pdf> This format has a multi-line header with various kinds of header lines which are 80 characters, but characters 1 to 60 are data, and 61 to 80 are the "label", and you need the label in order to know how to parse the data of each header line. This either requires deep (and slow) speculation, or some sort of look ahead feature. E.g., See page 69 of that spec document for an example. The headers appear not only at the start of the file, but at the start of each data block. We have run into the need for short, fixed-distance look ahead before in other formats. In the case of RINEX the lookahead is exactly 60 characters. In other formats it is also a very short distance, but one which is tricky to figure out as the spec doesn't say exactly how big the region to be looked past is. One would have to figure it out from the data format spec. It's always a constant of course. Mike Beckerle Apache Daffodil PMC | daffodil.apache.org<http://daffodil.apache.org/> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl<http://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl> Owl Cyber Defense | www.owlcyberdefense.com<http://www.owlcyberdefense.com/> -- dfdl-wg mailing list dfdl-wg@lists.ogf.org<mailto:dfdl-wg@lists.ogf.org> https://lists.ogf.org/mailman/listinfo/dfdl-wg Unless otherwise stated above: IBM United Kingdom Limited Registered in England and Wales with number 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

Ah, the magic hammer: regex. Of course I should have thought of that, but I'm always looking to avoid things that involve backtracking. Thanks On Wed, Jan 4, 2023 at 6:39 AM Steve Hanson <smh@uk.ibm.com> wrote:
Hi Mike
Happy New Year.
Perhaps I am missing something, but for this format I would use a discriminator with testKind="pattern" and a regex that skips past the first 60 bytes and looks at the next 20. A pattern discriminator is our way of peeking ahead at the data.
Regards
Steve Hanson
IBM Integration, Hursley, UK Architect, IBM DFDL Co-Chair, OGF DFDL Working Group smh@uk.ibm.com tel:+44-7717-378890 Note: I work Tuesday to Friday
-----Original Message----- *From*: Mike Beckerle <mbeckerle@apache.org <Mike%20Beckerle%20%3cmbeckerle@apache.org%3e>> *Reply-To*: mbeckerle@apache.org *To*: DFDL-WG <dfdl-wg@ogf.org <DFDL-WG%20%3cdfdl-wg@ogf.org%3e>> *Subject*: [EXTERNAL] [DFDL-WG] interesting format: RINEX *Date*: Tue, 20 Dec 2022 09:23:26 -0500
I ran into an interesting format today called RINEX that is problematic for DFDL v1. 0: http: //acc. igs. org/misc/rinex304. pdf This format has a multi-line header with various kinds of header lines which are 80 characters, but characters 1 to ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd I ran into an interesting format today called RINEX that is problematic for DFDL v1.0:
http://acc.igs.org/misc/rinex304.pdf
This format has a multi-line header with various kinds of header lines which are 80 characters, but characters 1 to 60 are data, and 61 to 80 are the "label", and you need the label in order to know how to parse the data of each header line.
This either requires deep (and slow) speculation, or some sort of look ahead feature.
E.g., See page 69 of that spec document for an example. The headers appear not only at the start of the file, but at the start of each data block.
We have run into the need for short, fixed-distance look ahead before in other formats. In the case of RINEX the lookahead is exactly 60 characters.
In other formats it is also a very short distance, but one which is tricky to figure out as the spec doesn't say exactly how big the region to be looked past is. One would have to figure it out from the data format spec. It's always a constant of course.
Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com
--
dfdl-wg mailing list
dfdl-wg@lists.ogf.org
https://lists.ogf.org/mailman/listinfo/dfdl-wg
Unless otherwise stated above:
IBM United Kingdom Limited Registered in England and Wales with number 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU -- dfdl-wg mailing list dfdl-wg@lists.ogf.org https://lists.ogf.org/mailman/listinfo/dfdl-wg
participants (2)
-
Mike Beckerle
-
Steve Hanson