I agree element references are useful in the way you state, and we've built many DFDL schemas in this manner. It's nice to build XML Schemas this way, it really has little to do with DFDL other than being a pattern that encourages unit testing of individual record types, which is good practice.

The rest of this is probably TL;DR, but rationalizes why we don't need any change to DFDL, just a flag in Daffodil to escalate a particular warning to an SDE.

The challenge is when you need to describe data in DFDL, and then pass it to something that does not accept XML, or treats XML as a giant string, so that you really want to instead map the DFDL infoset directly to the native data structures, not use an XML String. And... that native data structure has simple names, not namespaced names.

For example, define and populate Java POJOs from DFDL-described data. Types aka classes can be in packages with complex names that are like namespaces. But element names aka members of classes must have simple names like "A-Za-z09_" as the chars allowed. You can make big long "simple" names, but that's undesirable.

If a DFDL (or XSD) Schema has two elements that are peer children within the same parent, and they differ in QName only by namespace, XSD has no issue with it.

Daffodil will issue a warning (which can be suppressed) that this will be incompatible with data that has no namespaces. If you then convert such data to say, JSON, it will happily populate that same local name with different things, resulting in data that cannot be unparsed, can't be reliably queried, can't have a JSON-Schema, etc.

If the two elements have different namespace prefixes, one could append the prefix to the element name, and define POJO Java class based on the global element declaration - using the namespace and prefix to create a unique class name.

But it is possible for the two elements to have the same prefix - as it can be redefined in any enclosing context. In that case one must generate a unique name given the namespace - probably by just adding a numeric suffix to the prefixes to make those prefixes unique.

So it is possible (though a bit complex) to minimize this stuff, and generate unique names to get out of the way of this problem, however, this makes the data harder to use.

As an example, Apache Drill is a "use SQL on anything" tool. We've built an integration (mostly, not quite done) which allows it to query any data described by a DFDL schema in combination with any of the other databases and types it can query.

But its data model does not have element namespaces. For now we just fail if you have two elements that differ only by namespace. I.e., your DFDL schema is considered not suitable for Drill querying, and we suggest you change the schema.

To avoid this in advance I'm thinking of a Daffodil flag that escalates this name conflict warning (same name different namespaces) to an SDE, so that people will proactively get rid of it.

Unfortunately, we have found this element name problem sneaks into schemas. It occurs naturally if you are trying to create schemas that simultaneously handle multiple versions of the same data format. You end up wanting to have the same element name in one branch of a choice in a namespace for version 1, and the same element name in another branch of a choice for version 2, where the choice is discriminated by the version information. There is no getting around that when querying such data, a query (such as XPath) can only be polymorphic over versions if you are able to ignore/bypass the namespace part and use only the local name of the element in the query language. This can be done in XQuery or XPath using 'predicates' that match on fn:local-name(). DFDL expressions cannot do this as we only allow indexing in predicates.

On Tue, May 7, 2024 at 3:51 PM Steve Hanson <smhdfdl@gmail.com> wrote:

Mike

IBM DFDL as used by ACE has supported element refs since day one. They are really useful, as shown in the DFDL schemas for EDIFACT. Each EDIFACT message is a global element, so can be parsed on its own. But there is also the EDIFACT interchange global element, which is a collection of EDIFACT messages, so the natural approach is to use element refs to pull in the EDIFACT messages.

I'll try and join on Thursday but I am away Wed and Thurs, it all depends when I get home.

Regards
Steve

On Mon, May 6, 2024 at 11:20 PM Mike Beckerle <mbeckerle@apache.org> wrote:
I'm interested in what DFDL implementations support element references?

IBM ACE?
IBM zTPF?
DFDL4Space?

Can you let me know whether these implementations support element refs?

The reason I ask is below, which may be of interest or perhaps TL;DR.

We support element references in Daffodil, but I'm coming around to the view that element refs are a bad idea in DFDL schemas.

They're not needed for any specific data format expressive power. That suggests we should have left them out of DFDL, but for some reason we didn't.

The problem is that most data languages have nothing like element references and the associated element namespace management complexity available.

So as soon as you want to use a DFDL schema but not use it to interchange data as XML, element refs become a problem.

I'm playing around with a best practice/subset/profile suggestion where:

* The only global element declarations in the schema are for root elements.
* Element references are disallowed
* The root elements are declared in a root schema file that contains ONLY the root elements
* Root elements should always be declared by one-liners like this: `<element name="rootElement" type="prefix:rootElementType"/>`
* The root elements schema file has no target namespace.
* All group, type, and DFDL format/escapeScheme/variable definitions must be declared in different schema files that may (and probably should) have a target namespace.

The benefit of these restrictions is that the elements in the nest of a DFDL infoset never have any namespaces.
This makes them compatible with non-namespaced data systems like JSON, Apache Drill, Apache NiFi, Generated C code, etc.
This makes integration with those things *massively* simpler.

Such schemas are still easily reused by reusing the type of the root element, so there is no need to ever use an element reference, and a nice composition property occurs - you don't need element references to assemble schemas from component schemas, and the assembled component has the same characteristic.

There are a few other things this discipline also simplifies. Reusing test data becomes simpler if namespace URIs aren't getting embedded in every test infoset XML file, for example.

All comments are welcome.

Mike Beckerle
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com

--
dfdl-wg mailing list
dfdl-wg@lists.ogf.org
https://lists.ogf.org/mailman/listinfo/dfdl-wg