Andy, did you test both ICU4J and ICU4C ?

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair,
OGF DFDL Working Group
IBM SWG, Hursley, UK

smh@uk.ibm.com
tel:+44-1962-815848




From:        Andrew Edwards/UK/IBM
To:        Steve Hanson/UK/IBM@IBMGB,
Cc:        dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Mike Beckerle <mbeckerle.dfdl@gmail.com>
Date:        27/06/2013 17:18
Subject:        Re: [DFDL-WG] regex free-spacing mode



Hi Mike (et al),

I've gone back to the ICU doc for this and run a few tests locally.  It looks like both cases for non capturing groups can now be used in Java 7 and ICU 51.1.  In other words, both of the following constructs are supported:

        (?imsx-imsx)
        (?imsx-imsx:X)

So the quick answer is that what you are trying to do in your example below is supported.

The long answer is that errata 3.29 can probably be updated by removing the restriction on (?imsx-imsx:X), as below


3.29  Sections 24 and 30. The DFDL specification is not prescriptive enough when specifying what is allowed for regular expressions used in the length property and testPattern property.

Section 24 is replaced by the following.

"A DFDL regular expression may be specified for the dfdl:lengthPattern format property and the dfdl:testPattern attribute of the dfdl:assert and dfdl:discriminator annotations. DFDL regular expressions do not interpret DFDL entities.

A DFDL regular expression is defined by a set of valid pattern characters. For portability, a DFDL regular expression pattern is restricted to the inclusive subset of the ICU regular expression [ICURE] and the Java(R) 7 regular expression [JAVARE] with the Unicode flags UNICODE_CASE and UNICODE_CHARACTER_CLASS turned on.  The following regular expression constructs are not common to both ICU and Java(R) 7 and it is a schema definition error if any are used in a DFDL regular expression:

*Construct*                *Meaning*                                           *Notes*
\N{UNICODE CHARACTER NAME}  Match the named character                           ICU only

\X                          Match a Grapheme Cluster                            ICU only

\Uhhhhhhhh                  Match the character with the hex value hhhhhhhh.    ICU only

(?# ... )                   Free-format comment                                 ICU only

(?w-w)                      UREGEX_UWORD - Controls the behaviour of \b in      ICU only
                            a pattern.

(?d-d)                      UNIX_LINES - Enables Unix lines mode.               Java 7 only

(?u-u)                      UNICODE_CASE - Enables Unicode-aware case folding.  Java 7 only (1)

(?U-U)                      UNICODE_CHARACTER_CLASS - Enables the Unicode       Java 7 only (1)
                            version of Predefined character classes and POSIX    
                            character classes.                                  

(?imsx-imsx:X)              X, as a non-capturing group with the given flags.   Java 7 only
                            Note that the flags i,s,m,x are valid, but
                            appending :X to the flag is not.


Notes:
(1) Implementations using Java 7 must set flag UNICODE_CASE by default to match ICU.
(2) Implementations using Java 7 must set flag UNICODE_CHARACTER_CLASS by default to match ICU.

Additionally, the behaviour of the word character construct (\w) is not consistent in ICU and Java 7. In Java 7 \w is [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}],which is a larger set than ICU where \w is [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}].  The use of \w is not recommended in DFDL regular expressions in conjunction with Unicode encodings, and an implementation must issue a warning if such usage is detected.

Character properties are detailed by the Unicode Re gular Expressions [UNICODERE]."

Section 30 is updated to correct the references used in section 24:

-Add:[ICURE] - http://userguide.icu-project.org/strings/regexp
-Add:[UNICODERE] - http://www.unicode.org/reports/tr18/
-Remove:[PERLRE] - http://perldoc.perl.org/perlre.html#Extended-Patterns
-Change:[JAVARE] - http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html



Cheers,
Andy
Andy Edwards - IBM Integration Bus - DFDL

Email: andy.edwards@uk.ibm.com
Snail Mail:   MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN
Tel int: 247222
Tel ext: +44 (0)1962 817222
Desk: DE3 V17

The Feynman problem solving Algorithm
 1) Write down the problem
 2) Think real hard
 3) Write down the answer
-- Murray Gell-mann in the NY Times





Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Steve Hanson/UK/IBM

27/06/2013 10:26

To
Mike Beckerle <mbeckerle.dfdl@gmail.com>,
cc
dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Andrew Edwards/UK/IBM@IBMGB
Subject
Re: [DFDL-WG] regex free-spacing modeLink




Mike, I believe that is the case but I have copied Andy Edwards who is the person in the IBM DFDL team who added our regex support.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair,
OGF DFDL Working Group
IBM SWG, Hursley, UK

smh@uk.ibm.com
tel:+44-1962-815848




From:        Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:        dfdl-wg@ogf.org,
Date:        26/06/2013 18:56
Subject:        Re: [DFDL-WG] regex free-spacing mode
Sent by:        dfdl-wg-bounces@ogf.org




To clarify, errata v13 has this in the table for erratum 3.29 in the list of non-portables:

(?imsx-imsx:X)

X, as a non-capturing group with the
given flags. Note that the flags i,s,m,x
are valid, but appending :X to the flag is
not.

Java 7 only

I interpret this as meaning that only the so-called modifier-span notation (the : suffix) is disallowed, but not just plain (?x), but I wanted to be sure that was the correct interpretation.


On Wed, Jun 26, 2013 at 1:13 PM, Mike Beckerle <mbeckerle.dfdl@gmail.com> wrote:

I wrote this complicated regex today and it works in Daffodil.

Question is this. Is the (?x) which turns on regex free-spacing mode, officially supported in DFDL?

You can see from below that it is VERY desirable that it works.....

  <xs:simpleType name="frontMatterType">
      <xs:annotation>
        <xs:appinfo source="
http://www.ogf.org/dfdl/">
          <dfdl:simpleType lengthKind="pattern" terminator="%FF;">

            <dfdl:property name="lengthPattern"><![CDATA[(?x) # regex free spacing mode
            #
            # match the front matter of the document
            #
            .{1,8192}?                # up to 8K of front matter content
            #
            # front matter ends at the first message description page
            #
            (?=                       # lookahead (followed by but not including...)
              \f                      # a formfeed character
              (?> \s | \x08 ){1,100}? # whitespace or backspace (x08)
              MESSAGE\ DESCRIPTION\r  # this literal text
              \s{1,100}?              # up to 100 whitespaces
              -{19}\r                 # exactly 19 hyphens and a CR
            )                         # end lookahead
            ]]></dfdl:property>

           </dfdl:simpleType>
        </xs:appinfo>
      </xs:annotation>
      <xs:restriction base="xs:string" />
    </xs:simpleType>


--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com




--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
--
 dfdl-wg mailing list
 dfdl-wg@ogf.org
 
https://www.ogf.org/mailman/listinfo/dfdl-wg