
Andy, did you test both ICU4J and ICU4C ? Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Andrew Edwards/UK/IBM To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Mike Beckerle <mbeckerle.dfdl@gmail.com> Date: 27/06/2013 17:18 Subject: Re: [DFDL-WG] regex free-spacing mode Hi Mike (et al), I've gone back to the ICU doc for this and run a few tests locally. It looks like both cases for non capturing groups can now be used in Java 7 and ICU 51.1. In other words, both of the following constructs are supported: (?imsx-imsx) (?imsx-imsx:X) So the quick answer is that what you are trying to do in your example below is supported. The long answer is that errata 3.29 can probably be updated by removing the restriction on (?imsx-imsx:X), as below 3.29 Sections 24 and 30. The DFDL specification is not prescriptive enough when specifying what is allowed for regular expressions used in the length property and testPattern property. Section 24 is replaced by the following. "A DFDL regular expression may be specified for the dfdl:lengthPattern format property and the dfdl:testPattern attribute of the dfdl:assert and dfdl:discriminator annotations. DFDL regular expressions do not interpret DFDL entities. A DFDL regular expression is defined by a set of valid pattern characters. For portability, a DFDL regular expression pattern is restricted to the inclusive subset of the ICU regular expression [ICURE] and the Java(R) 7 regular expression [JAVARE] with the Unicode flags UNICODE_CASE and UNICODE_CHARACTER_CLASS turned on. The following regular expression constructs are not common to both ICU and Java(R) 7 and it is a schema definition error if any are used in a DFDL regular expression: *Construct* *Meaning* *Notes* \N{UNICODE CHARACTER NAME} Match the named character ICU only \X Match a Grapheme Cluster ICU only \Uhhhhhhhh Match the character with the hex value hhhhhhhh. ICU only (?# ... ) Free-format comment ICU only (?w-w) UREGEX_UWORD - Controls the behaviour of \b in ICU only a pattern. (?d-d) UNIX_LINES - Enables Unix lines mode. Java 7 only (?u-u) UNICODE_CASE - Enables Unicode-aware case folding. Java 7 only (1) (?U-U) UNICODE_CHARACTER_CLASS - Enables the Unicode Java 7 only (1) version of Predefined character classes and POSIX character classes. (?imsx-imsx:X) X, as a non-capturing group with the given flags. Java 7 only Note that the flags i,s,m,x are valid, but appending :X to the flag is not. Notes: (1) Implementations using Java 7 must set flag UNICODE_CASE by default to match ICU. (2) Implementations using Java 7 must set flag UNICODE_CHARACTER_CLASS by default to match ICU. Additionally, the behaviour of the word character construct (\w) is not consistent in ICU and Java 7. In Java 7 \w is [\p{Alpha}\p{gc=Mn}\p{gc=Me}\p{gc=Mc}\p{Digit}\p{gc=Pc}],which is a larger set than ICU where \w is [\p{Ll}\p{Lu}\p{Lt}\p{Lo}\p{Nd}]. The use of \w is not recommended in DFDL regular expressions in conjunction with Unicode encodings, and an implementation must issue a warning if such usage is detected. Character properties are detailed by the Unicode Re gular Expressions [UNICODERE]." Section 30 is updated to correct the references used in section 24: -Add:[ICURE] - http://userguide.icu-project.org/strings/regexp -Add:[UNICODERE] - http://www.unicode.org/reports/tr18/ -Remove:[PERLRE] - http://perldoc.perl.org/perlre.html#Extended-Patterns -Change:[JAVARE] - http://docs.oracle.com/javase/7/docs/api/java/util/regex/Pattern.html Cheers, Andy Andy Edwards - IBM Integration Bus - DFDL Email: andy.edwards@uk.ibm.com Snail Mail: MP211, Hursley park, Hursley, WINCHESTER, Hants, SO21 2JN Tel int: 247222 Tel ext: +44 (0)1962 817222 Desk: DE3 V17 The Feynman problem solving Algorithm 1) Write down the problem 2) Think real hard 3) Write down the answer -- Murray Gell-mann in the NY Times Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Steve Hanson/UK/IBM 27/06/2013 10:26 To Mike Beckerle <mbeckerle.dfdl@gmail.com>, cc dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Andrew Edwards/UK/IBM@IBMGB Subject Re: [DFDL-WG] regex free-spacing mode Mike, I believe that is the case but I have copied Andy Edwards who is the person in the IBM DFDL team who added our regex support. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 26/06/2013 18:56 Subject: Re: [DFDL-WG] regex free-spacing mode Sent by: dfdl-wg-bounces@ogf.org To clarify, errata v13 has this in the table for erratum 3.29 in the list of non-portables: (?imsx-imsx:X) X, as a non-capturing group with the given flags. Note that the flags i,s,m,x are valid, but appending :X to the flag is not. Java 7 only I interpret this as meaning that only the so-called modifier-span notation (the : suffix) is disallowed, but not just plain (?x), but I wanted to be sure that was the correct interpretation. On Wed, Jun 26, 2013 at 1:13 PM, Mike Beckerle <mbeckerle.dfdl@gmail.com> wrote: I wrote this complicated regex today and it works in Daffodil. Question is this. Is the (?x) which turns on regex free-spacing mode, officially supported in DFDL? You can see from below that it is VERY desirable that it works..... <xs:simpleType name="frontMatterType"> <xs:annotation> <xs:appinfo source="http://www.ogf.org/dfdl/"> <dfdl:simpleType lengthKind="pattern" terminator="%FF;"> <dfdl:property name="lengthPattern"><![CDATA[(?x) # regex free spacing mode # # match the front matter of the document # .{1,8192}? # up to 8K of front matter content # # front matter ends at the first message description page # (?= # lookahead (followed by but not including...) \f # a formfeed character (?> \s | \x08 ){1,100}? # whitespace or backspace (x08) MESSAGE\ DESCRIPTION\r # this literal text \s{1,100}? # up to 100 whitespaces -{19}\r # exactly 19 hyphens and a CR ) # end lookahead ]]></dfdl:property> </dfdl:simpleType> </xs:appinfo> </xs:annotation> <xs:restriction base="xs:string" /> </xs:simpleType> -- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com -- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg