
I've been dealing with lots of data formats lately which involve lookahead needs, and I'm trying to come up with a cleaner, easier, and less ad-hoc way to deal with them in DFDL to propose for the future. I wanted to bounce this idea off the workgroup before prototyping. Idea: Property dfdlx:peek="yes/no". Property on a sequence model group. Compatible with a hidden group ref. Unlike other properties, which are disallowed on a sequence with dfdl:hiddenGroupRef, this could be allowed on hidden sequences. That's actually a primary use case for this. If "no", no behavior change (what things do now). If "yes", the parse happens, including set variable assignments, infoset creation, etc. Then at the end of the sequence if the parse is successful the position in the data stream is reset to where it was at the start of the peek sequence. If the parse fails you backtrack to the enclosing PoU as normal and everything about the peek (when inside the PoU) is discarded. This allows you to learn by parsing something in the data more than once. Once to discover something which goes into the parser infoset (hidden or not), and into single-assignments to DFDL variables. The second time can parse making use of this learning. This is sort of like backtracking at a PoU, but you don't undo anything except the position in the data stream. On unparsing, all data written while unparsing the infoset for a sequence with dfdl:peek="yes" is discarded. Or maybe we can just say the infoset corresponding to a sequence with dfdl:peek="yes" is not unparsed at all. Implementations could put a limit on how far ahead you can peek. But a minimum of say, 512 bytes or maybe a bit bigger makes sense I think. That would be enough for every use case I have. I believe current restrictions in DFDL to ensure forward progress when parsing are sufficient to make it impossible to delay parsing forever with this. I.e., parsing can take a long time, but it still has to terminate (at least in theory, if there is enough memory for a big infoset). I think this dfdlx:peek has some nice properties. Pro: This is the most important thing: No specialized constructs for looking ahead. Just use DFDL to learn about the data, save it in variables or a piece of infoset that you can navigate with expressions to utilize the knowledge. Pro: Composition properties are good. Nothing new to learn. I can think of no impact on backtracking or any other aspects. Pro: Pretty cheap to implement, so long as the amount you can peek ahead is reasonably bounded. Pro: Synergistic with existing things like newVariableInstance and hidden groups to capture learning from a peek ahead. Con: Huge generality. Peeking ahead with a sequence with really rich sub-structure, PoUs and backtracking inside it, etc. That's all enabled by this feature, but none of the use cases I have need anything like that level of generality. This is one of those things where the stuff people will invent with it are unanticipated. All thoughts / musings are welcome. Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com
participants (1)
-
Mike Beckerle