Re: [DFDL-WG] DFDL regular expressions and Unicode - conformance

Jonathan No need for us to contact ICU, as Andy indicates below ICU and Java both claim conformance. Here's the words from errata 3.29. Please can you rephrase to combine the conformance requirement and the restrictions, so that we end up with a form you are happy with, then we can update the errata? A DFDL regular expression is defined by a set of valid pattern characters. For portability, a DFDL regular expression pattern is restricted to the inclusive subset of the ICU regular expression [ICURE] and the Java(R) 7 regular expression [JAVARE] with the Unicode flags UNICODE_CASE and UNICODE_CHARACTER_CLASS turned on. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Andrew Edwards/UK/IBM To: Steve Hanson/UK/IBM@IBMGB, Cc: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, dfdl-wg-bounces@ogf.org, "Cranford, Jonathan W." <jcranford@mitre.org> Date: 11/07/2013 14:19 Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode Hi Jonathan, Sorry for the delay; first week back in the office... As you've noted, errata 3.29 describes what DFDL regexes are supported. Specifically, it is a subset of Java 7's java.util.regex ( http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html) and ICU's regular expression support ( http://userguide.icu-project.org/strings/regexp), both of which conform with level 1 of Unicode technical standard #18 It looks like there are 2 stages to checking conformance: Logical - do the available regex constructs provide conformance to the technical standard. This is probably just a couple of hours of reading the Unicode standard rules and cross-checking the constructs in each matching engine. Actual - do Java 7 and ICU really match properly for each of the conformance statements. This can take an ever increasing amount of time testing various sets of data and regex patterns, and it risks the only reward being that we find bugs in Java 7 or ICU. Minimum would be 3 or 4 days of test generation. Does that answer the issue? Andy Andy Edwards - IBM Integration Bus - DFDL Email: andy.edwards@uk.ibm.com Snail Mail: MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN Tel int: 247222 Tel ext: +44 (0)1962 817222 Desk: DE3 V17 The Feynman problem solving Algorithm 1) Write down the problem 2) Think real hard 3) Write down the answer -- Murray Gell-mann in the NY Times Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Steve Hanson/UK/IBM 08/07/2013 11:08 To "Cranford, Jonathan W." <jcranford@mitre.org>, cc "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, dfdl-wg-bounces@ogf.org, Andrew Edwards/UK/IBM@IBMGB Subject Re: [DFDL-WG] DFDL regular expressions and Unicode Jonathan I've copied Andy who added regexs support into IBM DFDL recently. He might have an idea as to the effort involved in stating conformance. We will discuss your other two emails on next DFDL-WG call or so. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: "Cranford, Jonathan W." <jcranford@mitre.org> To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, Date: 06/07/2013 00:56 Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode Sent by: dfdl-wg-bounces@ogf.org Update: I just found errata 3.29, which answers this question, I think.
From the description in the errata, and looking at the documentation for java 7 regular expressions, it looks like DFDL regular expressions conform to level 1 of Unicode Regular expressions (UTS#18).
-----Original Message----- From: Cranford, Jonathan W. Sent: Friday, July 05, 2013 1:36 PM To: dfdl-wg@ogf.org Subject: DFDL regular expressions and Unicode
I've been going through the spec recently, and I have a few questions about DFDL regular expressions.
Rather than put them into one long email, I'll break them up into separate emails.
First question: What level of conformance to Unicode Technical Standard #18 UNICODE REGULAR EXPRESSIONS do DFDL regular expressions claim?
For example, * XML Schema regular expressions are "targeted at support of 'Level 1' features" (http://www.w3.org/TR/xmlschema-2/#dt-ccesN) * Java 1.4 regular expressions "implement its second level of support" ( http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html) * Perl 5.18 seems to implement most of Level 1 ( http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression- Support-Level)
I think the conformance level should be specified in the DFDL spec so
I still think there would be value in stating such conformance in the DFDL spec, but I suppose that would take some legwork for someone to actually confirm the conformance of ICU and Java7 to level 1. Very respectfully, -- Jonathan Cranford that it is
clear to schema designers what a regular expression would really match against. Details like case conversion and canonical equivalence make a difference when matching against a Unicode string.
Thanks in advance,
-- Jonathan W. Cranford <jcranford@mitre.org> Senior Information Systems Engineer The MITRE Corporation (http://www.mitre.org)
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg

How does this sound? I just added a sentence on the end.
A DFDL regular expression is defined by a set of valid pattern characters. For
portability, a DFDL regular expression pattern is restricted to the inclusive subset
of the ICU regular expression [ICURE] and the Java(R) 7 regular expression
[JAVARE] with the Unicode flags UNICODE_CASE and
UNICODE_CHARACTER_CLASS turned on. DFDL regular expressions thereby conform to
Unicode Technical Standard #18 , Unicode Regular Expressions, level 1 [UNICODERE].
-----Original Message-----
From: Steve Hanson [mailto:smh@uk.ibm.com]
Sent: Tuesday, July 16, 2013 9:13 AM
To: Andrew Edwards
Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org; Cranford, Jonathan W.
Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode - conformance
Jonathan
No need for us to contact ICU, as Andy indicates below ICU and Java both claim
conformance.
Here's the words from errata 3.29. Please can you rephrase to combine the
conformance requirement and the restrictions, so that we end up with a form you
are happy with, then we can update the errata?
A DFDL regular expression is defined by a set of valid pattern characters. For
portability, a DFDL regular expression pattern is restricted to the inclusive subset
of the ICU regular expression [ICURE] and the Java(R) 7 regular expression
[JAVARE] with the Unicode flags UNICODE_CASE and
UNICODE_CHARACTER_CLASS turned on.
DFDL regular expressions thereby conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1,
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group <http://www.ogf.org/dfdl/>
IBM SWG, Hursley, UK
smh@uk.ibm.com<mailto:smh@uk.ibm.com> <mailto:smh@uk.ibm.com>
tel:+44-1962-815848
From: Andrew Edwards/UK/IBM
To: Steve Hanson/UK/IBM@IBMGB,
Cc: "dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>" <dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>>, dfdl-wg-bounces@ogf.org<mailto:dfdl-wg-bounces@ogf.org>,
"Cranford, Jonathan W." <jcranford@mitre.org<mailto:jcranford@mitre.org>>
Date: 11/07/2013 14:19
Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode
________________________________
Hi Jonathan,
Sorry for the delay; first week back in the office...
As you've noted, errata 3.29 describes what DFDL regexes are supported.
Specifically, it is a subset of Java 7's java.util.regex
(http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
<http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html> ) and
ICU's regular expression support (http://userguide.icu-project.org/strings/regexp
<http://userguide.icu-project.org/strings/regexp> ), both of which conform with
level 1 of Unicode technical standard #18
It looks like there are 2 stages to checking conformance:
* Logical - do the available regex constructs provide conformance to the
technical standard. This is probably just a couple of hours of reading the Unicode
standard rules and cross-checking the constructs in each matching engine.
* Actual - do Java 7 and ICU really match properly for each of the
conformance statements. This can take an ever increasing amount of time
testing various sets of data and regex patterns, and it risks the only reward being
that we find bugs in Java 7 or ICU. Minimum would be 3 or 4 days of test
generation.
Does that answer the issue?
Andy
Andy Edwards - IBM Integration Bus <http://www-<http://www-03.ibm.com/software/products/us/en/integration-bus>
03.ibm.com/software/products/us/en/integration-bus<http://www-03.ibm.com/software/products/us/en/integration-bus>> - DFDL <https://w3-<https://w3-connections.ibm.com/wikis/home?lang=en-gb#!/wiki/IBM%20Data%20Format%20Description%20Language>
connections.ibm.com/wikis/home?lang=en-<https://w3-connections.ibm.com/wikis/home?lang=en-gb#!/wiki/IBM%20Data%20Format%20Description%20Language>
gb#!/wiki/IBM%20Data%20Format%20Description%20Language<https://w3-connections.ibm.com/wikis/home?lang=en-gb#!/wiki/IBM%20Data%20Format%20Description%20Language>>
Email: andy.edwards@uk.ibm.com<mailto:andy.edwards@uk.ibm.com> <mailto:andy.edwards@uk.ibm.com>
Snail Mail: MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN
Tel int: 247222
Tel ext: +44 (0)1962 817222
Desk: DE3 V17
The Feynman problem solving Algorithm
1) Write down the problem
2) Think real hard
3) Write down the answer
-- Murray Gell-mann in the NY Times
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Steve Hanson/UK/IBM
08/07/2013 11:08 To
"Cranford, Jonathan W." <jcranford@mitre.org<mailto:jcranford@mitre.org>>,
cc
"dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>" <dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>>, dfdl-wg-bounces@ogf.org<mailto:dfdl-wg-bounces@ogf.org>, Andrew
Edwards/UK/IBM@IBMGB
Subject
Re: [DFDL-WG] DFDL regular expressions and UnicodeLink
<Notes://D06ML014/80256D7F004ED63A/38D46BF5E8F08834852564B500129B2<Notes://D06ML014/80256D7F004ED63A/38D46BF5E8F08834852564B500129B2C/8054F31FB22A8880A1C918FA98057ED6>
C/8054F31FB22A8880A1C918FA98057ED6<Notes://D06ML014/80256D7F004ED63A/38D46BF5E8F08834852564B500129B2C/8054F31FB22A8880A1C918FA98057ED6>>
Jonathan
I've copied Andy who added regexs support into IBM DFDL recently. He might
have an idea as to the effort involved in stating conformance.
We will discuss your other two emails on next DFDL-WG call or so.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group <http://www.ogf.org/dfdl/>
IBM SWG, Hursley, UK
smh@uk.ibm.com<mailto:smh@uk.ibm.com> <mailto:smh@uk.ibm.com>
tel:+44-1962-815848
From: "Cranford, Jonathan W." <jcranford@mitre.org<mailto:jcranford@mitre.org>>
To: "dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>" <dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>>,
Date: 06/07/2013 00:56
Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode
Sent by: dfdl-wg-bounces@ogf.org<mailto:dfdl-wg-bounces@ogf.org>
________________________________
Update: I just found errata 3.29, which answers this question, I think.
From the description in the errata, and looking at the documentation for java 7
regular expressions, it looks like DFDL regular expressions conform to level 1 of
Unicode Regular expressions (UTS#18).
I still think there would be value in stating such conformance in the DFDL spec,
but I suppose that would take some legwork for someone to actually confirm the
conformance of ICU and Java7 to level 1.
Very respectfully,
-- Jonathan Cranford
-----Original Message-----
From: Cranford, Jonathan W.
Sent: Friday, July 05, 2013 1:36 PM
To: dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>
Subject: DFDL regular expressions and Unicode
I've been going through the spec recently, and I have a few questions about
DFDL
regular expressions.
Rather than put them into one long email, I'll break them up into separate
emails.
First question: What level of conformance to Unicode Technical Standard #18
UNICODE
REGULAR EXPRESSIONS do DFDL regular expressions claim?
For example,
* XML Schema regular expressions are "targeted at support of 'Level 1'
features"
* Java 1.4 regular expressions "implement its second level of support"
(http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
<http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html> )
* Perl 5.18 seems to implement most of Level 1
(http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-
<http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression->
Support-Level)
I think the conformance level should be specified in the DFDL spec so that it is
clear to schema
designers what a regular expression would really match against. Details
like case conversion and canonical equivalence make a difference when
matching against a Unicode string.
Thanks in advance,
--
Jonathan W. Cranford <jcranford@mitre.org<mailto:jcranford@mitre.org>>
Senior Information Systems Engineer
The MITRE Corporation (http://www.mitre.org <http://www.mitre.org/> )
--
dfdl-wg mailing list
dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org>

That looks good to me. Let's close on Tues WG call. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: "Cranford, Jonathan W." <jcranford@mitre.org> To: Steve Hanson/UK/IBM@IBMGB, Andrew Edwards/UK/IBM@IBMGB, Cc: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org> Date: 19/07/2013 20:12 Subject: RE: [DFDL-WG] DFDL regular expressions and Unicode - conformance How does this sound? I just added a sentence on the end.
A DFDL regular expression is defined by a set of valid pattern characters. For portability, a DFDL regular expression pattern is restricted to the inclusive subset of the ICU regular expression [ICURE] and the Java(R) 7 regular expression [JAVARE] with the Unicode flags UNICODE_CASE and UNICODE_CHARACTER_CLASS turned on. DFDL regular expressions thereby conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1 [UNICODERE].
-----Original Message----- From: Steve Hanson [mailto:smh@uk.ibm.com] Sent: Tuesday, July 16, 2013 9:13 AM To: Andrew Edwards Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org; Cranford, Jonathan W. Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode - conformance
Jonathan
No need for us to contact ICU, as Andy indicates below ICU and Java both claim conformance.
Here's the words from errata 3.29. Please can you rephrase to combine the conformance requirement and the restrictions, so that we end up with a form you are happy with, then we can update the errata?
A DFDL regular expression is defined by a set of valid pattern characters. For portability, a DFDL regular expression pattern is restricted to the inclusive subset of the ICU regular expression [ICURE] and the Java(R) 7 regular expression [JAVARE] with the Unicode flags UNICODE_CASE and UNICODE_CHARACTER_CLASS turned on.
DFDL regular expressions thereby conform to Unicode Technical Standard #18 , Unicode Regular Expressions, level 1,
Regards
Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK smh@uk.ibm.com <mailto:smh@uk.ibm.com> tel:+44-1962-815848
From: Andrew Edwards/UK/IBM To: Steve Hanson/UK/IBM@IBMGB, Cc: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, dfdl-wg-bounces@ogf.org, "Cranford, Jonathan W." <jcranford@mitre.org> Date: 11/07/2013 14:19 Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode
________________________________
Hi Jonathan,
Sorry for the delay; first week back in the office...
As you've noted, errata 3.29 describes what DFDL regexes are supported. Specifically, it is a subset of Java 7's java.util.regex (http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html <http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html> )
ICU's regular expression support ( http://userguide.icu-project.org/strings/regexp <http://userguide.icu-project.org/strings/regexp> ), both of which conform with level 1 of Unicode technical standard #18
It looks like there are 2 stages to checking conformance:
* Logical - do the available regex constructs provide conformance to the technical standard. This is probably just a couple of hours of reading
standard rules and cross-checking the constructs in each matching engine. * Actual - do Java 7 and ICU really match properly for each of
and the Unicode the
conformance statements. This can take an ever increasing amount of time testing various sets of data and regex patterns, and it risks the only reward being that we find bugs in Java 7 or ICU. Minimum would be 3 or 4 days of test generation.
Does that answer the issue? Andy Andy Edwards - IBM Integration Bus <http://www- 03.ibm.com/software/products/us/en/integration-bus> - DFDL <https://w3- connections.ibm.com/wikis/home?lang=en- gb#!/wiki/IBM%20Data%20Format%20Description%20Language>
Email: andy.edwards@uk.ibm.com <mailto:andy.edwards@uk.ibm.com> Snail Mail: MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN Tel int: 247222 Tel ext: +44 (0)1962 817222 Desk: DE3 V17
The Feynman problem solving Algorithm 1) Write down the problem 2) Think real hard 3) Write down the answer -- Murray Gell-mann in the NY Times
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Steve Hanson/UK/IBM
08/07/2013 11:08 To "Cranford, Jonathan W." <jcranford@mitre.org>, cc "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, dfdl-wg-bounces@ogf.org, Andrew Edwards/UK/IBM@IBMGB Subject Re: [DFDL-WG] DFDL regular expressions and UnicodeLink <Notes://D06ML014/80256D7F004ED63A/38D46BF5E8F08834852564B500129B2 C/8054F31FB22A8880A1C918FA98057ED6>
Jonathan
I've copied Andy who added regexs support into IBM DFDL recently. He might have an idea as to the effort involved in stating conformance.
We will discuss your other two emails on next DFDL-WG call or so.
Regards
Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK smh@uk.ibm.com <mailto:smh@uk.ibm.com> tel:+44-1962-815848
From: "Cranford, Jonathan W." <jcranford@mitre.org> To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, Date: 06/07/2013 00:56 Subject: Re: [DFDL-WG] DFDL regular expressions and Unicode Sent by: dfdl-wg-bounces@ogf.org
________________________________
Update: I just found errata 3.29, which answers this question, I think.
From the description in the errata, and looking at the documentation for java 7 regular expressions, it looks like DFDL regular expressions conform to level 1 of Unicode Regular expressions (UTS#18).
I still think there would be value in stating such conformance in the DFDL spec, but I suppose that would take some legwork for someone to actually confirm the conformance of ICU and Java7 to level 1.
Very respectfully,
-- Jonathan Cranford
-----Original Message----- From: Cranford, Jonathan W. Sent: Friday, July 05, 2013 1:36 PM To: dfdl-wg@ogf.org Subject: DFDL regular expressions and Unicode
I've been going through the spec recently, and I have a few questions about DFDL regular expressions.
Rather than put them into one long email, I'll break them up into separate emails.
First question: What level of conformance to Unicode Technical Standard #18 UNICODE REGULAR EXPRESSIONS do DFDL regular expressions claim?
For example, * XML Schema regular expressions are "targeted at support of 'Level 1' features" (http://www.w3.org/TR/xmlschema-2/#dt-ccesN <http://www.w3.org/TR/xmlschema-2/#dt-ccesN> ) * Java 1.4 regular expressions "implement its second level of support"
( http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html < http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html> )
* Perl 5.18 seems to implement most of Level 1 ( http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression- <http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-> Support-Level)
I think the conformance level should be specified in the DFDL spec so that it is clear to schema designers what a regular expression would really match against. Details like case conversion and canonical equivalence make a difference when matching against a Unicode string.
Thanks in advance,
-- Jonathan W. Cranford <jcranford@mitre.org> Senior Information Systems Engineer The MITRE Corporation (http://www.mitre.org <http://www.mitre.org/> )
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg <https://www.ogf.org/mailman/listinfo/dfdl-wg>
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (2)
-
Cranford, Jonathan W.
-
Steve Hanson