Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK smh@uk.ibm.com
tel:+44-1962-815848
From:
Andrew Edwards/UK/IBM
To:
Steve Hanson/UK/IBM@IBMGB,
Cc:
dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org,
Mike Beckerle <mbeckerle.dfdl@gmail.com>
Date:
27/06/2013 17:18
Subject:
Re: [DFDL-WG]
regex free-spacing mode
Hi Mike (et al),
I've gone back to the ICU doc for this
and run a few tests locally. It looks like both cases for non capturing
groups can now be used in Java 7 and ICU 51.1. In other words, both
of the following constructs are supported:
(?imsx-imsx)
(?imsx-imsx:X)
So the quick answer is that what you
are trying to do in your example below is supported.
The long answer is that errata 3.29
can probably be updated by removing the restriction on (?imsx-imsx:X),
as below
3.29 Sections 24 and 30. The DFDL
specification is not prescriptive enough when specifying what is allowed
for regular expressions used in the length property and testPattern property.
Section 24 is replaced by the following.
"A DFDL regular expression may be
specified for the dfdl:lengthPattern format property and the dfdl:testPattern
attribute of the dfdl:assert and dfdl:discriminator annotations. DFDL regular
expressions do not interpret DFDL entities.
A DFDL regular expression is defined by
a set of valid pattern characters. For portability, a DFDL regular expression
pattern is restricted to the inclusive subset of the ICU regular expression
[ICURE] and the Java(R) 7 regular expression [JAVARE] with the Unicode
flags UNICODE_CASE and UNICODE_CHARACTER_CLASS turned on. The following
regular expression constructs are not common to both ICU and Java(R) 7
and it is a schema definition error if any are used in a DFDL regular expression:
*Construct*
*Meaning*
*Notes*
\N{UNICODE CHARACTER NAME} Match
the named character
ICU only
\X
Match a Grapheme Cluster
ICU only
\Uhhhhhhhh
Match the character with the hex value
hhhhhhhh. ICU only
(?# ... )
Free-format comment
ICU only
(?w-w)
UREGEX_UWORD - Controls the behaviour
of \b in ICU only
a pattern.
(?d-d)
UNIX_LINES - Enables Unix lines
mode. Java 7 only
(?u-u)
UNICODE_CASE - Enables Unicode-aware
case folding. Java 7 only (1)
(?U-U)
UNICODE_CHARACTER_CLASS - Enables
the Unicode Java 7 only (1)
version of Predefined
character classes and POSIX
character classes.
(?imsx-imsx:X)
X, as a non-capturing group with
the given flags. Java 7 only
Note that the flags i,s,m,x are valid, but
appending :X to the flag is not.
Notes:
(1) Implementations using Java 7 must set
flag UNICODE_CASE by default to match ICU.
(2) Implementations using Java 7 must set
flag UNICODE_CHARACTER_CLASS by default to match ICU.
Additionally, the behaviour of the word
character construct (\w) is not consistent in ICU and Java 7. In Java 7
\w is [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}],which is
a larger set than ICU where \w is [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]. The
use of \w is not recommended in DFDL regular expressions in conjunction
with Unicode encodings, and an implementation must issue a warning if such
usage is detected.
Character properties are detailed by the
Unicode Re gular Expressions [UNICODERE]."
Section 30 is updated to correct the references
used in section 24:
MP211,
Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN
Tel
int:
247222
Tel
ext:
+44
(0)1962 817222
Desk:
DE3
V17
The
Feynman problem solving Algorithm
1) Write down the problem
2) Think real hard
3) Write down the answer
-- Murray Gell-mann in the NY Times
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Steve
Hanson/UK/IBM
27/06/2013 10:26
To
Mike
Beckerle <mbeckerle.dfdl@gmail.com>,
cc
dfdl-wg@ogf.org,
dfdl-wg-bounces@ogf.org, Andrew Edwards/UK/IBM@IBMGB
Mike, I believe that is the case but
I have copied Andy Edwards who is the person in the IBM DFDL team who added
our regex support.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK smh@uk.ibm.com
tel:+44-1962-815848
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
dfdl-wg@ogf.org,
Date:
26/06/2013 18:56
Subject:
Re: [DFDL-WG]
regex free-spacing mode
Sent by:
dfdl-wg-bounces@ogf.org
To clarify, errata v13 has this in the table for erratum
3.29 in the list of non-portables:
(?imsx-imsx:X)
X, as a non-capturing group with the
given flags. Note that the flags i,s,m,x
are valid, but appending :X to the flag is
not.
Java 7 only
I interpret this as meaning that only the so-called modifier-span notation
(the : suffix) is disallowed, but not just plain (?x), but I wanted to
be sure that was the correct interpretation.
<dfdl:property name="lengthPattern"><![CDATA[(?x)
# regex free spacing mode
#
# match
the front matter of the document
#
.{1,8192}?
# up to 8K of front matter content
#
# front
matter ends at the first message description page
#
(?=
# lookahead (followed
by but not including...)
\f
# a formfeed character
(?> \s | \x08 ){1,100}?
# whitespace or backspace (x08)
MESSAGE\ DESCRIPTION\r
# this literal text
\s{1,100}?
# up to 100 whitespaces
-{19}\r
# exactly 19 hyphens and a CR
)
# end lookahead
]]></dfdl:property>