Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK smh@uk.ibm.com
tel:+44-1962-815848
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB,
Cc:
dfdl-wg@ogf.org
Date:
24/07/2013 01:33
Subject:
for errata r14
- Fwd: [DFDL-WG] DFDL regular expressions and Unicode - conformance
I am assuming this issue will get handled as part of a r14 erratum.
No need for us to contact ICU, as Andy indicates below ICU and Java both
claim conformance.
Here's the words from errata 3.29. Please can you rephrase to combine
the conformance requirement and the restrictions, so that we end up with
a form you are happy with, then we can update the errata?
A DFDL regular expression is defined by a set of valid pattern characters.
For portability, a DFDL regular expression pattern is restricted
to the inclusive subset of the ICU regular expression [ICURE] and the Java(R)
7 regular expression [JAVARE] with the Unicode flags UNICODE_CASE and UNICODE_CHARACTER_CLASS
turned on.
It looks like there are 2 stages to checking conformance:
Logical - do the available regex constructs
provide conformance to the technical standard. This is probably just
a couple of hours of reading the Unicode standard rules and cross-checking
the constructs in each matching engine.
Actual - do Java 7 and ICU really match
properly for each of the conformance statements. This can take an
ever increasing amount of time testing various sets of data and regex patterns,
and it risks the only reward being that we find bugs in Java 7 or ICU.
Minimum would be 3 or 4 days of test generation.
MP211,
Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN
Tel
int:
247222
Tel
ext:
+44
(0)1962 817222
Desk:
DE3
V17
The
Feynman problem solving Algorithm
1) Write down the problem
2) Think real hard
3) Write down the answer
-- Murray Gell-mann in the NY Times
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Update: I just found errata 3.29, which answers this question, I think.
>From the description in the errata, and looking at the documentation for
java 7 regular expressions, it looks like DFDL regular expressions conform
to level 1 of Unicode Regular expressions (UTS#18).
I still think there would be value in stating such conformance in the DFDL
spec, but I suppose that would take some legwork for someone to actually
confirm the conformance of ICU and Java7 to level 1.
Very respectfully,
-- Jonathan Cranford
>-----Original Message-----
>From: Cranford, Jonathan W.
>Sent: Friday, July 05, 2013 1:36 PM
>To: dfdl-wg@ogf.org
>Subject: DFDL regular expressions and Unicode
>
>I've been going through the spec recently, and I have a few questions
about DFDL
>regular expressions.
>
>Rather than put them into one long email, I'll break them up into separate
emails.
>
>First question: What level of conformance to Unicode Technical
Standard #18
>UNICODE
> REGULAR EXPRESSIONS do DFDL regular expressions claim?
>
> For example,
> * XML Schema regular expressions are "targeted at
support of 'Level 1'
>features"
> (http://www.w3.org/TR/xmlschema-2/#dt-ccesN)
> * Java 1.4 regular expressions "implement its second
level of support"
> (http://docs.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html)
> * Perl 5.18 seems to implement most of Level 1
> (http://perldoc.perl.org/perlunicode.html#Unicode-Regular-Expression-
>Support-Level)
>
> I think the conformance level should be specified in
the DFDL spec so that it is
>clear to schema
> designers what a regular expression would really match
against. Details
> like case conversion and canonical equivalence make a
difference when
> matching against a Unicode string.
>
>Thanks in advance,
>
>--
>Jonathan W. Cranford <jcranford@mitre.org>
>Senior Information Systems Engineer
>The MITRE Corporation (http://www.mitre.org)
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU