Clarification needed: regular expressions - does '.' match newlines by default?
 
            A key behavior distinction in regular expressions is whether the '.' wildcard matches line endings or not. Regular expression libraries can be configured, usually by some sort of expression modifier, either way so that the '.' will not match a line ending or so that it will. Question is, how is it configured by default in DFDL regular expressions? This is part of the overall issue of tightening up regular expressions as part of DFDL. I.e., what exactly is the regex dialect, and how is it configured by default. ...mike -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412
 
            I would vote for this feature to be switched off by default in DFDL processors. It is mainly useful when dealing with lines of text, but DFDL formats are not always lines of text. So to be 100% clear, I think the '.' wildcard should match all characters, including line endings. regards, Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 14/11/2012 12:53 Subject: [DFDL-WG] Clarification needed: regular expressions - does '.' match newlines by default? Sent by: dfdl-wg-bounces@ogf.org A key behavior distinction in regular expressions is whether the '.' wildcard matches line endings or not. Regular expression libraries can be configured, usually by some sort of expression modifier, either way so that the '.' will not match a line ending or so that it will. Question is, how is it configured by default in DFDL regular expressions? This is part of the overall issue of tightening up regular expressions as part of DFDL. I.e., what exactly is the regex dialect, and how is it configured by default. ...mike -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
 
            I agree with Tim's opinion, but add that this is *NOT* the default behavior of the java regex library we're using in Daffodil currently. One must prefix all regex's by (?s) I believe to achieve the non-default line-ending behavior. On Wed, Nov 14, 2012 at 11:15 AM, Tim Kimber <KIMBERT@uk.ibm.com> wrote:
I would vote for this feature to be switched off by default in DFDL processors. It is mainly useful when dealing with lines of text, but DFDL formats are not always lines of text. So to be 100% clear, I think the '.' wildcard should match all characters, including line endings.
regards,
Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742
From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 14/11/2012 12:53 Subject: [DFDL-WG] Clarification needed: regular expressions - does '.' match newlines by default? Sent by: dfdl-wg-bounces@ogf.org ------------------------------
A key behavior distinction in regular expressions is whether the '.' wildcard matches line endings or not.
Regular expression libraries can be configured, usually by some sort of expression modifier, either way so that the '.' will not match a line ending or so that it will.
Question is, how is it configured by default in DFDL regular expressions?
This is part of the overall issue of tightening up regular expressions as part of DFDL. I.e., what exactly is the regex dialect, and how is it configured by default.
...mike
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412
 
            I came across this issue couple of weeks ago.. the regular expression syntax used in XML Schema is strict than what is supported in Java regular expression. DFDL regular expression syntax and restrictions should match XML schema specification.. Here is an example for which APAR has been opened and we will supplying fix in WMB toolkit to make regular expression comply to the XML Schema spec... The following line causes the XML schema compiler to fail - <xsd:pattern value="([a-zA-Z0-9 ]|\-|\.|_|\(|\)|\\|\/|.&|\')*"/> Here the customer has escaped forward slash and single quote characters. Instead of \/ it should be / and instead of \' it should be ' Following is accepted by XML Schema compiler.. <xsd:pattern value="([a-zA-Z0-9 ]|\-|\.|_|\(|\)|\\|/|.&|')*"/> Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Tim Kimber <KIMBERT@uk.ibm.com>, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org Date: 11/14/2012 12:46 PM Subject: Re: [DFDL-WG] Clarification needed: regular expressions - does '.' match newlines by default? Sent by: dfdl-wg-bounces@ogf.org I agree with Tim's opinion, but add that this is *NOT* the default behavior of the java regex library we're using in Daffodil currently. One must prefix all regex's by (?s) I believe to achieve the non-default line-ending behavior. On Wed, Nov 14, 2012 at 11:15 AM, Tim Kimber <KIMBERT@uk.ibm.com> wrote: I would vote for this feature to be switched off by default in DFDL processors. It is mainly useful when dealing with lines of text, but DFDL formats are not always lines of text. So to be 100% clear, I think the '.' wildcard should match all characters, including line endings. regards, Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 14/11/2012 12:53 Subject: [DFDL-WG] Clarification needed: regular expressions - does '.' match newlines by default? Sent by: dfdl-wg-bounces@ogf.org A key behavior distinction in regular expressions is whether the '.' wildcard matches line endings or not. Regular expression libraries can be configured, usually by some sort of expression modifier, either way so that the '.' will not match a line ending or so that it will. Question is, how is it configured by default in DFDL regular expressions? This is part of the overall issue of tightening up regular expressions as part of DFDL. I.e., what exactly is the regex dialect, and how is it configured by default. ...mike -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
 
            I was in a meeting the other day where a number of people said they believe the regex capabilities offered in XML Schema are not sufficient. I am not exactly sure what XML Schema leaves out, but I have many examples making use of look-ahead/look-behind features, and I suspect those may be an issue. ...mike On Wed, Nov 14, 2012 at 12:59 PM, Suman Kalia <kalia@ca.ibm.com> wrote:
I came across this issue couple of weeks ago.. the regular expression syntax used in XML Schema is strict than what is supported in Java regular expression. DFDL regular expression syntax and restrictions should match XML schema specification..
Here is an example for which APAR has been opened and we will supplying fix in WMB toolkit to make regular expression comply to the XML Schema spec...
The following line causes the XML schema compiler to fail -
<xsd:pattern value="([a-zA-Z0-9 ]|\-|\.|_|\(|\)|\\|\/|.&|\')*"/>
Here the customer has escaped forward slash and single quote characters. Instead of \/ it should be / and instead of \' it should be '
Following is accepted by XML Schema compiler..
<xsd:pattern value="([a-zA-Z0-9 ]|\-|\.|_|\(|\)|\\|/|.&|')*"/>
Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com
For info on Message broker
http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht...
From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Tim Kimber <KIMBERT@uk.ibm.com>, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org Date: 11/14/2012 12:46 PM Subject: Re: [DFDL-WG] Clarification needed: regular expressions - does '.' match newlines by default? Sent by: dfdl-wg-bounces@ogf.org ------------------------------
I agree with Tim's opinion, but add that this is *NOT* the default behavior of the java regex library we're using in Daffodil currently. One must prefix all regex's by (?s) I believe to achieve the non-default line-ending behavior.
On Wed, Nov 14, 2012 at 11:15 AM, Tim Kimber <*KIMBERT@uk.ibm.com*<KIMBERT@uk.ibm.com>> wrote: I would vote for this feature to be switched off by default in DFDL processors. It is mainly useful when dealing with lines of text, but DFDL formats are not always lines of text. So to be 100% clear, I think the '.' wildcard should match all characters, including line endings.
regards,
Tim Kimber, DFDL Team, Hursley, UK Internet: *kimbert@uk.ibm.com* <kimbert@uk.ibm.com> Tel. 01962-816742 Internal tel. 37246742
From: Mike Beckerle <*mbeckerle.dfdl@gmail.com*<mbeckerle.dfdl@gmail.com>
To: *dfdl-wg@ogf.org* <dfdl-wg@ogf.org>, Date: 14/11/2012 12:53 Subject: [DFDL-WG] Clarification needed: regular expressions - does '.' match newlines by default? Sent by: *dfdl-wg-bounces@ogf.org* <dfdl-wg-bounces@ogf.org> ------------------------------
A key behavior distinction in regular expressions is whether the '.' wildcard matches line endings or not.
Regular expression libraries can be configured, usually by some sort of expression modifier, either way so that the '.' will not match a line ending or so that it will.
Question is, how is it configured by default in DFDL regular expressions?
This is part of the overall issue of tightening up regular expressions as part of DFDL. I.e., what exactly is the regex dialect, and how is it configured by default.
...mike
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: *781-330-0412* <781-330-0412> -- dfdl-wg mailing list *dfdl-wg@ogf.org* <dfdl-wg@ogf.org> *https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412
 
            Let's be clear on the two kinds of regex that DFDL requires. 1) Regexs as used in lengthKind 'pattern' and testKind 'pattern' must absolutely not be XML schema regexs. They are way too restrictive and don't allow any of the look-ahead capability that you get with Java or PERL. This has caused no end of problems with IBM MRM's TDS pattern facility. 2) Regexs as used in the xs:pattern facet for validation. These must be regular XSDL regexs so that a DFDL schema is a genuine XML Schema. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Suman Kalia <kalia@ca.ibm.com>, Cc: dfdl-wg@ogf.org Date: 14/11/2012 18:24 Subject: Re: [DFDL-WG] Clarification needed: regular expressions - does '.' match newlines by default? Sent by: dfdl-wg-bounces@ogf.org I was in a meeting the other day where a number of people said they believe the regex capabilities offered in XML Schema are not sufficient. I am not exactly sure what XML Schema leaves out, but I have many examples making use of look-ahead/look-behind features, and I suspect those may be an issue. ...mike On Wed, Nov 14, 2012 at 12:59 PM, Suman Kalia <kalia@ca.ibm.com> wrote: I came across this issue couple of weeks ago.. the regular expression syntax used in XML Schema is strict than what is supported in Java regular expression. DFDL regular expression syntax and restrictions should match XML schema specification.. Here is an example for which APAR has been opened and we will supplying fix in WMB toolkit to make regular expression comply to the XML Schema spec... The following line causes the XML schema compiler to fail - <xsd:pattern value="([a-zA-Z0-9 ]|\-|\.|_|\(|\)|\\|\/|.&|\')*"/> Here the customer has escaped forward slash and single quote characters. Instead of \/ it should be / and instead of \' it should be ' Following is accepted by XML Schema compiler.. <xsd:pattern value="([a-zA-Z0-9 ]|\-|\.|_|\(|\)|\\|/|.&|')*"/> Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Tim Kimber <KIMBERT@uk.ibm.com>, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org Date: 11/14/2012 12:46 PM Subject: Re: [DFDL-WG] Clarification needed: regular expressions - does '.' match newlines by default? Sent by: dfdl-wg-bounces@ogf.org I agree with Tim's opinion, but add that this is *NOT* the default behavior of the java regex library we're using in Daffodil currently. One must prefix all regex's by (?s) I believe to achieve the non-default line-ending behavior. On Wed, Nov 14, 2012 at 11:15 AM, Tim Kimber <KIMBERT@uk.ibm.com> wrote: I would vote for this feature to be switched off by default in DFDL processors. It is mainly useful when dealing with lines of text, but DFDL formats are not always lines of text. So to be 100% clear, I think the '.' wildcard should match all characters, including line endings. regards, Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 14/11/2012 12:53 Subject: [DFDL-WG] Clarification needed: regular expressions - does '.' match newlines by default? Sent by: dfdl-wg-bounces@ogf.org A key behavior distinction in regular expressions is whether the '.' wildcard matches line endings or not. Regular expression libraries can be configured, usually by some sort of expression modifier, either way so that the '.' will not match a line ending or so that it will. Question is, how is it configured by default in DFDL regular expressions? This is part of the overall issue of tightening up regular expressions as part of DFDL. I.e., what exactly is the regex dialect, and how is it configured by default. ...mike -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (4)
- 
                 Mike Beckerle Mike Beckerle
- 
                 Steve Hanson Steve Hanson
- 
                 Suman Kalia Suman Kalia
- 
                 Tim Kimber Tim Kimber