Ah, the magic hammer: regex.

Of course I should have thought of that, but I'm always looking to avoid things that involve backtracking.

Thanks



On Wed, Jan 4, 2023 at 6:39 AM Steve Hanson <smh@uk.ibm.com> wrote:
Hi Mike

Happy New Year.

Perhaps I am missing something, but for this format I would use a discriminator with testKind="pattern" and a regex that skips past the first 60 bytes and looks at the next 20. A pattern discriminator is our way of peeking ahead at the data. 
 
Regards
 
Steve Hanson

IBM Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
tel:+44-7717-378890
Note: I work Tuesday to Friday

-----Original Message-----
From: Mike Beckerle <mbeckerle@apache.org>
Reply-To: mbeckerle@apache.org
To: DFDL-WG <dfdl-wg@ogf.org>
Subject: [EXTERNAL] [DFDL-WG] interesting format: RINEX
Date: Tue, 20 Dec 2022 09:23:26 -0500

I ran into an interesting format today called RINEX that is problematic for DFDL v1. 0:  http: //acc. igs. org/misc/rinex304. pdf This format has a multi-line header with various kinds of header lines which are 80 characters, but characters 1 to
ZjQcmQRYFpfptBannerStart
This Message Is From an External Sender
This message came from outside your organization.
 
ZjQcmQRYFpfptBannerEnd
I ran into an interesting format today called RINEX that is problematic for DFDL v1.0:

 http://acc.igs.org/misc/rinex304.pdf

This format has a multi-line header with various kinds of header lines which are 80 characters, but characters 1 to 60 are data, and 61 to 80 are the "label", and you need the label in order to know how to parse the data of each header line. 

This either requires deep (and slow) speculation, or some sort of look ahead feature. 

E.g., See page 69 of that spec document for an example. The headers appear not only at the start of the file, but at the start of each data block. 

We have run into the need for short, fixed-distance look ahead before in other formats. 
In the case of RINEX the lookahead is exactly 60 characters. 

In other formats it is also a very short distance, but one which is tricky to figure out as the spec doesn't say exactly how big the region to be looked past is. One would have to figure it out from the data format spec. It's always a constant of course. 

Mike Beckerle 
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


--
  dfdl-wg mailing list
  dfdl-wg@lists.ogf.org
  https://lists.ogf.org/mailman/listinfo/dfdl-wg 
Unless otherwise stated above:

IBM United Kingdom Limited
Registered in England and Wales with number 741598
Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU
--
  dfdl-wg mailing list
  dfdl-wg@lists.ogf.org
  https://lists.ogf.org/mailman/listinfo/dfdl-wg