The errata for action 193 has been updated
below,
please review for next WG call.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve
Hanson/UK/IBM on 04/02/2013 12:50 -----
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB,
Cc:
dfdl-wg@ogf.org
Date:
28/01/2013 13:49
Subject:
Re: [DFDL-WG]
Action 193: First draft for errata for RegEx
To me this is excellent work, much appreciated. I'd like to be much more
directing about the non-portable constructs.
We should decide among only these choices:
1) the non-portable constructs are disallowed. It is an SDE to use them.
The check is required for all compliant DFDL implementations (that implement
regular expressions at all.)
2) the non-portable constructs are allowed, but not recommended, and DFDL
implementations are *required* to issue non-portability warnings if these
constructs are used.
Not checking this, hoping for the best, user-beware, is a bad idea. A scanner
to find these syntaxes and disallow them is pretty easy to write. Regular
expressions are, by their very nature, not very rich. You implement an
escape scheme, anything else you scan for appearance of the offending constructs.
Ironically, it's something that can be done with a regular expression itself.
...mike
On Mon, Jan 28, 2013 at 8:22 AM, Steve Hanson <smh@uk.ibm.com>
wrote:
Here's a draft errata for action 193,
for review on the next WG call.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
============================================================
Section 24 to read as follows:
A DFDL regular expression may be specified for the dfdl:lengthPattern format
property and the
dfdl:testPattern attribute of the dfdl:assert and dfdl:disciminator annotations.
DFDL regular
expressions do not interpret DFDL entities.
A DFDL regular expression is defined by a set of valid pattern characters.
For portability,
a DFDL regular expression
pattern is restricted to the inclusive subset of the ICU regular
expression [ICURE] and the Java(R) 7 regular
expression [JAVARE] with the Unicode flags
UNICODE_CASE and UNICODE_CHARACTER_CLASS
turned on. The following regular expression
constructs are not common to both ICU and
Java(R) 7 and it is a schema
definition error if
any are used in a DFDL regular
expression:
*Construct* *Meaning*
*Notes*
\N{UNICODE CHARACTER NAME} Match the named character
ICU only
\X
Match a Grapheme Cluster
ICU
only
\Uhhhhhhhh Match
the character with the hex value hhhhhhhh. ICU only
(?# ... )
Free-format comment
ICU only
(?w-w)
UREGEX_UWORD - Controls the behaviour of \b in ICU
only
a pattern.
(?d-d)
UNIX_LINES - Enables Unix lines mode.
Java 7 only
(?u-u)
UNICODE_CASE - Enables Unicode-aware case folding. Java 7 only
(1)
(?U-U)
UNICODE_CHARACTER_CLASS - Enables
the Unicode Java 7 only
(2)
version of Predefined
character classes and POSIX
character classes.
(?imsx-imsx:X) X, as a
non-capturing group with the given flags. Java 7 only
Note that the flags i,s,m,x are valid, but
appending :X to the flag is not.
Notes:
(1) Implementations using Java
7 must set flag UNICODE_CASE by default to match ICU:
(2) Implementations using Java
7 must set flag UNICODE_CHARACTER_CLASS by default to match ICU:
Additionally, the behaviour of the word
character construct (\w) is not consistent in ICU and Java 7. In Java 7
\w is [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}],
which is a larger set than ICU where \w is [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].
The use of \w is not recommended
in DFDL regular expressions in conjunction with Unicode
encodings, and an implementation
must issue a warning if such usage is detected.
Character properties are detailed by
the Unicode Regular Expressions [UNICODERE].
Section 30 to add:
[ICURE] - http://userguide.icu-project.org/strings/regexp
[JAVARE] - http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html
[UNICODERE] - http://www.unicode.org/reports/tr18/
Section 30 to remove:
[PERLRE] - http://perldoc.perl.org/perlre.html#Extended-Patterns
[JAVARE] - http://download.oracle.com/javase/1.4.2/docs/api/java/util/regex/Pattern.html
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU