Re: [DFDL-WG] DFDL regular expressions and Unicode

Update: I just found errata 3.29, which answers this question, I think.
From the description in the errata, and looking at the documentation for java 7 regular expressions, it looks like DFDL regular expressions conform to level 1 of Unicode Regular expressions (UTS#18).
I still think there would be value in stating such conformance in the DFDL spec, but I suppose that would take some legwork for someone to actually confirm the conformance of ICU and Java7 to level 1. Very respectfully, -- Jonathan Cranford
-----Original Message----- From: Cranford, Jonathan W. Sent: Friday, July 05, 2013 1:36 PM To: dfdl-wg@ogf.org Subject: DFDL regular expressions and Unicode
I've been going through the spec recently, and I have a few questions about DFDL regular expressions.
Rather than put them into one long email, I'll break them up into separate emails.
First question: What level of conformance to Unicode Technical Standard #18 UNICODE REGULAR EXPRESSIONS do DFDL regular expressions claim?
For example, * XML Schema regular expressions are "targeted at support of 'Level 1' features" (http://www.w3.org/TR/xmlschema-2/#dt-ccesN) * Java 1.4 regular expressions "implement its second level of support" (http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html) * Perl 5.18 seems to implement most of Level 1 (http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression- Support-Level)
I think the conformance level should be specified in the DFDL spec so that it is clear to schema designers what a regular expression would really match against. Details like case conversion and canonical equivalence make a difference when matching against a Unicode string.
Thanks in advance,
-- Jonathan W. Cranford <jcranford@mitre.org> Senior Information Systems Engineer The MITRE Corporation (http://www.mitre.org)

Jonathan I've copied Andy who added regexs support into IBM DFDL recently. He might have an idea as to the effort involved in stating conformance. We will discuss your other two emails on next DFDL-WG call or so. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: "Cranford, Jonathan W." <jcranford@mitre.org> To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, Date: 06/07/2013 00:56 Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode Sent by: dfdl-wg-bounces@ogf.org Update: I just found errata 3.29, which answers this question, I think.
From the description in the errata, and looking at the documentation for java 7 regular expressions, it looks like DFDL regular expressions conform to level 1 of Unicode Regular expressions (UTS#18).
-----Original Message----- From: Cranford, Jonathan W. Sent: Friday, July 05, 2013 1:36 PM To: dfdl-wg@ogf.org Subject: DFDL regular expressions and Unicode
I've been going through the spec recently, and I have a few questions about DFDL regular expressions.
Rather than put them into one long email, I'll break them up into separate emails.
First question: What level of conformance to Unicode Technical Standard #18 UNICODE REGULAR EXPRESSIONS do DFDL regular expressions claim?
For example, * XML Schema regular expressions are "targeted at support of 'Level 1' features" (http://www.w3.org/TR/xmlschema-2/#dt-ccesN) * Java 1.4 regular expressions "implement its second level of support" ( http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html) * Perl 5.18 seems to implement most of Level 1 ( http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression- Support-Level)
I think the conformance level should be specified in the DFDL spec so
I still think there would be value in stating such conformance in the DFDL spec, but I suppose that would take some legwork for someone to actually confirm the conformance of ICU and Java7 to level 1. Very respectfully, -- Jonathan Cranford that it is
clear to schema designers what a regular expression would really match against. Details like case conversion and canonical equivalence make a difference when matching against a Unicode string.
Thanks in advance,
-- Jonathan W. Cranford <jcranford@mitre.org> Senior Information Systems Engineer The MITRE Corporation (http://www.mitre.org)
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Ok, thanks Steve. I'll try to start dialing into the weekly meetings to join in the conversation. -Jonathan
-----Original Message----- From: Steve Hanson [mailto:smh@uk.ibm.com] Sent: Monday, July 08, 2013 4:11 AM To: Cranford, Jonathan W. Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org; Andrew Edwards Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode
Jonathan
I've copied Andy who added regexs support into IBM DFDL recently. He might have an idea as to the effort involved in stating conformance.
We will discuss your other two emails on next DFDL-WG call or so.
Regards
Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK smh@uk.ibm.com <mailto:smh@uk.ibm.com> tel:+44-1962-815848
From: "Cranford, Jonathan W." <jcranford@mitre.org> To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, Date: 06/07/2013 00:56 Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode Sent by: dfdl-wg-bounces@ogf.org
________________________________
Update: I just found errata 3.29, which answers this question, I think.
From the description in the errata, and looking at the documentation for java 7 regular expressions, it looks like DFDL regular expressions conform to level 1 of Unicode Regular expressions (UTS#18).
I still think there would be value in stating such conformance in the DFDL spec, but I suppose that would take some legwork for someone to actually confirm the conformance of ICU and Java7 to level 1.
Very respectfully,
-- Jonathan Cranford
-----Original Message----- From: Cranford, Jonathan W. Sent: Friday, July 05, 2013 1:36 PM To: dfdl-wg@ogf.org Subject: DFDL regular expressions and Unicode
I've been going through the spec recently, and I have a few questions about DFDL regular expressions.
Rather than put them into one long email, I'll break them up into separate emails.
First question: What level of conformance to Unicode Technical Standard #18 UNICODE REGULAR EXPRESSIONS do DFDL regular expressions claim?
For example, * XML Schema regular expressions are "targeted at support of 'Level 1' features" (http://www.w3.org/TR/xmlschema-2/#dt-ccesN <http://www.w3.org/TR/xmlschema-2/#dt-ccesN> ) * Java 1.4 regular expressions "implement its second level of support"
(http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html <http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html> )
* Perl 5.18 seems to implement most of Level 1 (http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression- <http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-> Support-Level)
I think the conformance level should be specified in the DFDL spec so that it is clear to schema designers what a regular expression would really match against. Details like case conversion and canonical equivalence make a difference when matching against a Unicode string.
Thanks in advance,
-- Jonathan W. Cranford <jcranford@mitre.org> Senior Information Systems Engineer The MITRE Corporation (http://www.mitre.org <http://www.mitre.org/> )
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg <https://www.ogf.org/mailman/listinfo/dfdl-wg>
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (2)
-
Cranford, Jonathan W.
-
Steve Hanson