I have been concerned for a while that combinations of the features in DFDL v1.0 can be used in ways that are just far too powerful for what is needed for data format expression.

For example, if the input is n bits, how large should the corresponding infoset be when the data is parsed using a DFDL schema? Clearly at least O(n) since there are n bits and each bit could create an element in the infoset.

But should it be able to be O(n^2) ?

I can think of no use case for this. Intuition about data formats suggests that the infoset should always be linearly related in size to the size of the input data, and that the constant factor on that should be relatively small.

Furthermore, the time taken to parse data should be linearly related to the size of the data. I think lots of people are thinking DFDL parsing should be implementable in a circuit, and the memory requirements are proportional to the data size.

It doesn't make sense to try to make hardware implementations of DFDL if that's not going to be the case.

Turns out, using DFDL v1.0 using dfdl:occursCountKind='expression' and dfdl:inputValueCalc, can create an infoset from parsing that is super-exponentially larger than the input data. Yes, it is finite, and bounded to some function of the size of the input data. But we don't even have notations for this kind of numeric growth.

The schema I created consumes a 3 bit integer as the data. The size of the infoset grows like this:

input 000 -> infoset size 1
input 001 -> infoset size 1
input 010 -> infoset size 256
input 011 -> infoset size is over 10,000 elements
...
input 111 -> infoset size is larger than the number of atoms in the universe.

There are TDML tests for this schema that illustrate this (except that last one)

But it is always finite! No recursion is required to do this. With infinite space and time available it WILL terminate. I believe the language remains theoretically decidable.

The README in this schema project discusses this and other pathological cases, and suggests things we can do to restrict DFDL so that this kind of pathological behavior is not allowed.

Really we want the largest infoset that can be created from

n bits of data
a schema of size s

to be O(sn) in size. I have started an argument in the README about why O(sn) is the right target. Basically there are practical schemas that can meet O(sn) as a lower bound.

I believe this can be achieved by adding more forward-progress requirements to DFDL - all arrays should have to make forward progress when parsing, not just unbounded arrays, and reused groups/types should have to make forward progress at points of reuse. By doing that, we ensure that to make a giant infoset, you have to be consuming a giant number of bits.

Right now this stuff is just living in a personal github repo, but I'll give it an official better home either in DFDLSchemas or as part of Apache Daffodil, or something later.

If this is of interest, please take a look at this github pull request below.

---------- Forwarded message ---------
From: Mike Beckerle <notifications@github.com>
Date: Wed, Jan 24, 2024 at 3:50 PM
Subject: [mbeckerle/superexp] Example schema that illustrates pathological DFDL behaviors (PR #1)
To: mbeckerle/superexp <superexp@noreply.github.com>
Cc: Mike Beckerle <mbeckerle@apache.org>, Your activity <your_activity@noreply.github.com>

Illustrates DFDL issues where parsing small data can lead to explosively larger parse output. Super exponentially larger.

The expressive power of DFDL should be restricted so that this is not possible.

See the README.md for discussion of the issue.

You can view, comment on, or merge this pull request online at:

https://github.com/mbeckerle/superexp/pull/1

Commit Summary

46dbae6 First version of this example schema

File Changes

(9 files)

Patch Links:

—
Reply to this email directly, view it on GitHub, or unsubscribe.
You are receiving this because you are subscribed to this thread.