possible issue - can't use pattern to validate DFDL content containing newlines

Since data contains any characters, the DFDL infoset allows any characters. However, XML does not allow any characters. Furthermore, XML Schema Pattern facets are expressed using this XML Schema fragment: <xs:pattern value="...some regex pattern here ..."/> But XML attributes are normalized by XML readers/parsers. Line endings in them are converted to single spaces. So <xs:pattern value="abc def"/> is equivalent to: <xs:pattern value="abc def"/> Furthermore <xs:pattern value="abc def"/> is also normalized to <xs:pattern value="abc def"/> As far as I can tell there is no alternate notation to this. This means, if you want to use a pattern facet to specify that a DFDL infoset string can contain A-Za-z0-9 spaces and line endings, there is no way to express this. This pattern was the example I was dealing with. <xs:pattern value="[A-Za-z0-9 ]*"/> If you look at the string for the value attribute of this pattern element, that string already has the line ending characters converted into spaces. The attribute value is "[A-Za-z0-9 ]*" which has 3 spaces before the "]". I think there is no workaround for this in XML, XSD, or DFDL. I dug into the Daffodil implementation and in the code that accesses this attribute, you don't even get a NodeSeq containing a mixture of Text and Entity nodes. You just get a single Text node. So it is pretty well hopeless without reaching under the XML parser/reader's guts. Hence, in DFDL if you want to "validate" that a DFDL string contains content that includes line-endings with a regex, you have to use dfdl:assert with failureType="recoverableError" testKind="pattern" and testPattern with the regex of interest. This is then a DFDL regex, which is a Java regex, and you can be explicit about line endings allowed. You can't do it with a pattern facet. Comments? Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense | www.owlcyberdefense.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy <http://www.ogf.org/About/abt_policies.php>

Nevermind. I figured this out. Just slow today I guess. You can't use , but you can use \n in the regex. And similarly \r \t, etc. The pattern in question is just: <xs:pattern value="[A-Za-z0-9 \n\r]*"/> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense | www.owlcyberdefense.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy <http://www.ogf.org/About/abt_policies.php> On Wed, Mar 3, 2021 at 6:32 PM Mike Beckerle <mbeckerle.dfdl@gmail.com> wrote:
Since data contains any characters, the DFDL infoset allows any characters.
However, XML does not allow any characters.
Furthermore, XML Schema Pattern facets are expressed using this XML Schema fragment:
<xs:pattern value="...some regex pattern here ..."/>
But XML attributes are normalized by XML readers/parsers. Line endings in them are converted to single spaces.
So
<xs:pattern value="abc def"/>
is equivalent to:
<xs:pattern value="abc def"/>
Furthermore
<xs:pattern value="abc def"/>
is also normalized to
<xs:pattern value="abc def"/>
As far as I can tell there is no alternate notation to this.
This means, if you want to use a pattern facet to specify that a DFDL infoset string can contain A-Za-z0-9 spaces and line endings, there is no way to express this.
This pattern was the example I was dealing with.
<xs:pattern value="[A-Za-z0-9 ]*"/>
If you look at the string for the value attribute of this pattern element, that string already has the line ending characters converted into spaces. The attribute value is "[A-Za-z0-9 ]*" which has 3 spaces before the "]".
I think there is no workaround for this in XML, XSD, or DFDL.
I dug into the Daffodil implementation and in the code that accesses this attribute, you don't even get a NodeSeq containing a mixture of Text and Entity nodes. You just get a single Text node. So it is pretty well hopeless without reaching under the XML parser/reader's guts.
Hence, in DFDL if you want to "validate" that a DFDL string contains content that includes line-endings with a regex, you have to use dfdl:assert with failureType="recoverableError" testKind="pattern" and testPattern with the regex of interest. This is then a DFDL regex, which is a Java regex, and you can be explicit about line endings allowed.
You can't do it with a pattern facet.
Comments?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Owl Cyber Defense | www.owlcyberdefense.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy <http://www.ogf.org/About/abt_policies.php>
participants (1)
-
Mike Beckerle