Re: [DFDL-WG] Action 313: Plus '+' sign and lax textNumberCheckPolicy

https://unicode-org.atlassian.net/browse/ICU-20896 issue raised. I still think we need to pin DFDL 1.0 to a specific release(s). Regards Steve Hanson IBM Hybrid Integration, Hursley, UK Architect, IBM DFDL Co-Chair, OGF DFDL Working Group smh@uk.ibm.com tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Steve Hanson/UK/IBM To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, slawrence@apache.org Cc: DFDL-WG <dfdl-wg@ogf.org>, Liam O'Neill/UK/IBM@IBMGB Date: 30/08/2019 15:48 Subject: Re: [DFDL-WG] Action 313: Plus '+' sign and lax textNumberCheckPolicy ICU changing behaviour in an incompatible way is not good. IBM DFDL is way behind, and is still on ICU 51.2. We are limited in what we can do as we try to keep the same level as IBM Integration Bus & WTX as we have had C namespacing issues in the past. Looking at the links, there are other changes that have crept in when lenient. - The string must contain a complete prefix and suffix. For example, if the pattern is "{#};(#)", then "{123}" or "(123)" would match, but "{123", "123}", and "123" would all fail. (The latter strings would be accepted in lenient mode.) - Minus and plus signs can only appear if specified in the pattern. In lenient mode, a plus or minus sign can always precede a number. In typical ICU fashion, even this is not complete. It says nothing about what happens if the pattern has a sign and the data doesn't. I suggest you test all the combos with Daffodil and establish the truth. Then we need to decide what to do. If there is no way of controlling this (eg, parameter or env var) then the safest option is to backoff Daffodil to the latest ICU release that matches the DFDL 1.0 spec, and change the spec so that the link to ICU is specific rather than the generic link which is in the spec today ( http://www.icu-project.org/apiref/icu4c/classDecimalFormat.html#_details) and which floats to the latest release. We can't have a moving target. Regards Steve Hanson IBM Hybrid Integration, Hursley, UK Architect, IBM DFDL Co-Chair, OGF DFDL Working Group smh@uk.ibm.com tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: DFDL-WG <dfdl-wg@ogf.org> Date: 29/08/2019 19:49 Subject: [DFDL-WG] Action 313: Plus '+' sign and lax textNumberCheckPolicy Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org> Looks like ICU changed behavior.... From: Steve Lawrence <slawrence@apache.org> Sent: Thursday, August 29, 2019 1:30 PM To: users@daffodil.apache.org Subject: Re: Plus '+' sign and lax textNumberCheckPolicy - was: Re: How to model a fixed-length integer that may be padded with space on the left? I think this is a difference in ICU version? A little grepping through ICU source, I found a change [1] to their number parsing logic in Dec 2017: + if (!isStrict) { + parser.addMatcher(WhitespaceMatcher.getInstance()); + parser.addMatcher(new PlusSignMatcher()); + } That looks to me like a change to make it so plus signs are always matched in lax/lenient mode regardless of the pattern (Daffodils current behavior). A couple minor changes have been made to that section, but nothing that allows you to turn if off if lenient is on. It's hard to tell in the git history what release that was in, but it looks like around version 61, which is relatively new (Daffodil is on version 62). Also, the latest version of DecimalFormatProperties.java (looks to be an internal implementation, so no online javadocs), has javadocs that states that plus signs are always allowed in lenient/lax mode [2]. I think this is a change in ICU behavior in newer versions. - Steve [1] https://github.com/unicode-org/icu/commit/68340c8464bd988477d6c88f46f9dfe456... [2] https://github.com/unicode-org/icu/blob/master/icu4j/main/classes/core/src/c... -- dfdl-wg mailing list dfdl-wg@ogf.org https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=3ChVO33_CdzLR4-KiNysrkvHD0nubDCPHCy5_kKGtdg&s=j9EKBKn9GDdlIMk2iOCDS8DJM93RkV5whdP8Da_-bMk&e= Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Our team has observed that upgrading to a newer ICU was done actually to fix some other bugs, so backing out to a prior rev may be trading one set of bugs for another. I cannot recollect exactly what issues/bugs though. Since ICU is on github, we do have the option to actually fix the bug (by adding some compatibility flag that selects the older/preferred behavior), and issuing a pull request. Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy <http://www.ogf.org/About/abt_policies.php> On Wed, Nov 13, 2019 at 12:31 PM Steve Hanson <smh@uk.ibm.com> wrote:
https://unicode-org.atlassian.net/browse/ICU-20896 issue raised.
I still think we need to pin DFDL 1.0 to a specific release(s).
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK Architect, *IBM DFDL* <http://www.ibm.com/developerworks/library/se-dfdl/index.html> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> *smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday
From: Steve Hanson/UK/IBM To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, slawrence@apache.org Cc: DFDL-WG <dfdl-wg@ogf.org>, Liam O'Neill/UK/IBM@IBMGB Date: 30/08/2019 15:48 Subject: Re: [DFDL-WG] Action 313: Plus '+' sign and lax textNumberCheckPolicy ------------------------------
ICU changing behaviour in an incompatible way is not good.
IBM DFDL is way behind, and is still on ICU 51.2. We are limited in what we can do as we try to keep the same level as IBM Integration Bus & WTX as we have had C namespacing issues in the past.
Looking at the links, there are other changes that have crept in when lenient.
- The string must contain a complete prefix and suffix. For example, if the pattern is "{#};(#)", then "{123}" or "(123)" would match, but "{123", "123}", and "123" would all fail. (The latter strings would be accepted in lenient mode.) - Minus and plus signs can only appear if specified in the pattern. In lenient mode, a plus or minus sign can always precede a number.
In typical ICU fashion, even this is not complete. It says nothing about what happens if the pattern has a sign and the data doesn't.
I suggest you test all the combos with Daffodil and establish the truth.
Then we need to decide what to do. If there is no way of controlling this (eg, parameter or env var) then the safest option is to backoff Daffodil to the latest ICU release that matches the DFDL 1.0 spec, and change the spec so that the link to ICU is specific rather than the generic link which is in the spec today ( http://www.icu-project.org/apiref/icu4c/classDecimalFormat.html#_details) and which floats to the latest release. We can't have a moving target.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK Architect, *IBM DFDL* <http://www.ibm.com/developerworks/library/se-dfdl/index.html> Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> *smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday
From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: DFDL-WG <dfdl-wg@ogf.org> Date: 29/08/2019 19:49 Subject: [DFDL-WG] Action 313: Plus '+' sign and lax textNumberCheckPolicy Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org> ------------------------------
Looks like ICU changed behavior....
From: Steve Lawrence <*slawrence@apache.org* <slawrence@apache.org>> Sent: Thursday, August 29, 2019 1:30 PM To: *users@daffodil.apache.org* <users@daffodil.apache.org> Subject: Re: Plus '+' sign and lax textNumberCheckPolicy - was: Re: How to model a fixed-length integer that may be padded with space on the left?
I think this is a difference in ICU version?
A little grepping through ICU source, I found a change [1] to their number parsing logic in Dec 2017:
+ if (!isStrict) { + parser.addMatcher(WhitespaceMatcher.getInstance()); + parser.addMatcher(new PlusSignMatcher()); + }
That looks to me like a change to make it so plus signs are always matched in lax/lenient mode regardless of the pattern (Daffodils current behavior). A couple minor changes have been made to that section, but nothing that allows you to turn if off if lenient is on.
It's hard to tell in the git history what release that was in, but it looks like around version 61, which is relatively new (Daffodil is on version 62).
Also, the latest version of DecimalFormatProperties.java (looks to be an internal implementation, so no online javadocs), has javadocs that states that plus signs are always allowed in lenient/lax mode [2].
I think this is a change in ICU behavior in newer versions.
- Steve
[1]
*https://github.com/unicode-org/icu/commit/68340c8464bd988477d6c88f46f9dfe456... <https://github.com/unicode-org/icu/commit/68340c8464bd988477d6c88f46f9dfe4562a6d02#diff-565b07c255337881b4e06f766691667cR119-R122> [2]
*https://github.com/unicode-org/icu/blob/master/icu4j/main/classes/core/src/c... <https://github.com/unicode-org/icu/blob/master/icu4j/main/classes/core/src/com/ibm/icu/impl/number/DecimalFormatProperties.java#L53-L54>
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

You'd need to change both the Java and C versions. Regards Steve Hanson IBM Hybrid Integration, Hursley, UK Architect, IBM DFDL Co-Chair, OGF DFDL Working Group smh@uk.ibm.com tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Steve Hanson <smh@uk.ibm.com> Cc: slawrence@apache.org, DFDL-WG <dfdl-wg@ogf.org>, "Liam O'Neill" <WILONEIL@uk.ibm.com> Date: 14/11/2019 16:22 Subject: [EXTERNAL] Re: [DFDL-WG] Action 313: Plus '+' sign and lax textNumberCheckPolicy Our team has observed that upgrading to a newer ICU was done actually to fix some other bugs, so backing out to a prior rev may be trading one set of bugs for another. I cannot recollect exactly what issues/bugs though. Since ICU is on github, we do have the option to actually fix the bug (by adding some compatibility flag that selects the older/preferred behavior), and issuing a pull request. Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy On Wed, Nov 13, 2019 at 12:31 PM Steve Hanson <smh@uk.ibm.com> wrote: https://unicode-org.atlassian.net/browse/ICU-20896 issue raised. I still think we need to pin DFDL 1.0 to a specific release(s). Regards Steve Hanson IBM Hybrid Integration, Hursley, UK Architect, IBM DFDL Co-Chair, OGF DFDL Working Group smh@uk.ibm.com tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Steve Hanson/UK/IBM To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, slawrence@apache.org Cc: DFDL-WG <dfdl-wg@ogf.org>, Liam O'Neill/UK/IBM@IBMGB Date: 30/08/2019 15:48 Subject: Re: [DFDL-WG] Action 313: Plus '+' sign and lax textNumberCheckPolicy ICU changing behaviour in an incompatible way is not good. IBM DFDL is way behind, and is still on ICU 51.2. We are limited in what we can do as we try to keep the same level as IBM Integration Bus & WTX as we have had C namespacing issues in the past. Looking at the links, there are other changes that have crept in when lenient. - The string must contain a complete prefix and suffix. For example, if the pattern is "{#};(#)", then "{123}" or "(123)" would match, but "{123", "123}", and "123" would all fail. (The latter strings would be accepted in lenient mode.) - Minus and plus signs can only appear if specified in the pattern. In lenient mode, a plus or minus sign can always precede a number. In typical ICU fashion, even this is not complete. It says nothing about what happens if the pattern has a sign and the data doesn't. I suggest you test all the combos with Daffodil and establish the truth. Then we need to decide what to do. If there is no way of controlling this (eg, parameter or env var) then the safest option is to backoff Daffodil to the latest ICU release that matches the DFDL 1.0 spec, and change the spec so that the link to ICU is specific rather than the generic link which is in the spec today ( http://www.icu-project.org/apiref/icu4c/classDecimalFormat.html#_details) and which floats to the latest release. We can't have a moving target. Regards Steve Hanson IBM Hybrid Integration, Hursley, UK Architect, IBM DFDL Co-Chair, OGF DFDL Working Group smh@uk.ibm.com tel:+44-1962-815848 mob:+44-7717-378890 Note: I work Tuesday to Friday From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: DFDL-WG <dfdl-wg@ogf.org> Date: 29/08/2019 19:49 Subject: [DFDL-WG] Action 313: Plus '+' sign and lax textNumberCheckPolicy Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org> Looks like ICU changed behavior.... From: Steve Lawrence <slawrence@apache.org> Sent: Thursday, August 29, 2019 1:30 PM To: users@daffodil.apache.org Subject: Re: Plus '+' sign and lax textNumberCheckPolicy - was: Re: How to model a fixed-length integer that may be padded with space on the left? I think this is a difference in ICU version? A little grepping through ICU source, I found a change [1] to their number parsing logic in Dec 2017: + if (!isStrict) { + parser.addMatcher(WhitespaceMatcher.getInstance()); + parser.addMatcher(new PlusSignMatcher()); + } That looks to me like a change to make it so plus signs are always matched in lax/lenient mode regardless of the pattern (Daffodils current behavior). A couple minor changes have been made to that section, but nothing that allows you to turn if off if lenient is on. It's hard to tell in the git history what release that was in, but it looks like around version 61, which is relatively new (Daffodil is on version 62). Also, the latest version of DecimalFormatProperties.java (looks to be an internal implementation, so no online javadocs), has javadocs that states that plus signs are always allowed in lenient/lax mode [2]. I think this is a change in ICU behavior in newer versions. - Steve [1] https://github.com/unicode-org/icu/commit/68340c8464bd988477d6c88f46f9dfe456... [2] https://github.com/unicode-org/icu/blob/master/icu4j/main/classes/core/src/c... -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (2)
-
Mike Beckerle
-
Steve Hanson