[dfdl-wg] Usage scenarios

28 Sep 2005

      Hi Bob,

Here is my list of ways in which DFDL could be used - I have probably
missed some but here's enough to kick off with. If you need more details
on any of the real-world examples let me know.

Cheers,

Martin

Description
-----------

In QCD physics independent research groups from all over the world have
data which is always a 4d array of floating point values. However,
different groups have different standards for precision, dimension
order, byte order. It would be useful for them to have a simple,
canonical XML language for describing the format of a file. In the first
instance this need only be human readable.

Archiving
---------

Data needs to be stored, but the programs and systems for reading it
become obsolete. DFDL provides a valuable possibility of describing all
the details of a particular format so that even if there were no
programs able to read a format, the description (and the standard) would
provide sufficient information to access archived data. (There are lots
of examples of this type they might include atmospheric measurement).

A sophistication of this is that archived data may need to be
transformed as it is moved to up to data physical media (changes of
precision) etc. It would be nice if DFDL could (a) annotate these
changes (b) (perhaps) be used to ensure that the changes did not result
in data loss.

Format abstraction
------------------

At the simplest level the QCD physicists (described above) would like to
be able to have a single API that would allow them to read any described
piece of data, and carry out all the transformations required to ensure
that they get the correct array in memory.

I have examples of potential users who are just interested in describing
byte-order in a standard way.

At the next level we would like to supply a high level DFDL description
that captures a standard view of the data, and have generic DFDL logic
that can transform an existing DFDL-described format into this generic
view. This is one of the primary motivations for "layers" in the
standard. It is a very powerful feature but it introduces scoping
issues: What transformations can DFDL not describe? (also what
transformations can DFDL not describe efficiently).

Generic data access
-------------------

A DFDL library should provide the ability to interrogate a data
description and read all aspects of the data into memory. An example of
a generic tool is a browser that will allow arbitrary DFDL-described
data to be displayed in some sensible human-readable form. This case
requires the standard to specify an API for reading and interrogating
the data. The favoured suggestion for this is to extend DOM/SAX to allow
the reading of data fields directly into in-memory types (float, int,
char etc.)

Data queries
------------

The DFDL description implies an associated XML document. This document
can be queried using XPath/XQuery to extract pieces of data.

[Note: If the data comes back as an XML-XPath result then this process
is straight forward. With BinX we tried to return the data in a similar
format to the one it is represented in with an accompanying description.
We found a number of issues arose in this case that may or may not also
arise for DFDL].

Data annotations
----------------

The same XPath/XQuery expressions that can be used to query a document
can provide external (format independent) annotations. For example NASA
stores photographic images of hurricanes. A scientist can identify a
blob of pixels that correspond to the hurricane in an image. They could
like to store this annotation is such a way that the will be preserved
through future transformations (e.g. new image format, or different
pixel depth, or compression level). Note the point here is that a byte
offset into the image data cannot do this.

XML without the tags
--------------------

There are groups who would like to use DFDL as a sort of cheap data
compression technique. An example here is particle physics collision
data. This is stored as a set of sparse (hence variable sized) trees of
results. The data is richly structured trees and they would like to
access it and talk about it as if it were in XML but they don't want to
(cannot afford to) represent it using XML markup or use conventional XML
tools to parse it.

The idea is that such a group would design a new binary format that
could be described in XML and then they would work with the implied XML
data. Note: naturally these folks do not want to access their floating
point values as strings so they would want the sort of DOM extensions
that we alluded to earlier. For this same reason things like Binary XML
do not solve their problem.

Another example comes from the astronomy community has recently moved
from a long-standing binary data format (FITS) to an XML version
(VOTable). FITS was very rich in metadata but also included binary
images and large tables of observational data representations. VOTable
is great for capturing the metadata in a standard way but leads to
excessive bloat for images and large tables. The community has ended up
with a complicated compromise in which they allow raw binary data in at
the bottom of the XML file. A DFDL-described format could provide a
cleaner solution.

[dfdl-wg] Usage scenarios

Martin Westhead