Action 204: Establish strict versus lax behaviour for ICU calendar patterns

For the subset of ICU symbols that DFDL supports, here is what ICU claim: 1) Lenient parsing behaviour when in 'strict' mode: a) case insensitive matching for text fields b) MMM, MMMM, MMMMM all accept either short or long form of Month c) E, EE, EEE, EEEE, EEEEE **, EEEEEE *** all accept either abbreviated, full, narrow and short forms of Day of Week d) accept truncated leftmost numeric field (eg, pattern "HHmmss" allows "123456" (12:34:56) and "23456" (2:34:56) but not "3456") 2) Additional lenient parsing behaviour when in 'lax' mode: a) values outside valid ranges are normalized (eg, "March 32 1996" is treated as "April 1 1996") b) ignoring a trailing dot after a non-numeric field c) leading and trailing whitespace in the data but not in the pattern is accepted **** d) whitespace in the pattern can be missing in the data e) partial matching on literal strings (eg, data "20130621d" allowed for pattern "yyyyMMdd'date' " **** ** Bug found when testing this - EEEEE 'narrow' form completely broken - ICU ticket raised. *** EEEEEE and eeeeee are new and support a 2 char version of 'short' form - eg Tu or Mo. Not currently allowed by DFDL, we should consider allowing it. **** Only currently in ICU4C. ICU4J will be changed to match ICU4C. Note: IBM is in discussion with ICU to provide a 'really strict' mode (name tbd) which has no leniency at all. We need to decide whether to reflect all three variants in the dfdl:calendarCheckPolicy, or whether to remap our 'strict' to the new 'really strict' mode when it appears. Given where we are I think is a DFDL 2.0 item. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

This is helpful. Given where we are, let's just put this in as doc of what strict and lax mean. I'm in favor of adding the variations of EEEE... and eeee... which are supported by ICU. This is upward compatible, and will avoid need for a special check to exclude them. The broken EEEEE form is just a bug - I'd say this is just a release note item for products providing DFDL, unless ICU fixes it 'real soon now'. On Wed, Aug 14, 2013 at 8:58 AM, Steve Hanson <smh@uk.ibm.com> wrote:
For the subset of ICU symbols that DFDL supports, here is what ICU claim:
*1) Lenient parsing behaviour when in 'strict' mode: * a) case insensitive matching for text fields b) MMM, MMMM, MMMMM all accept either short or long form of Month c) E, EE, EEE, EEEE, EEEEE **, EEEEEE *** all accept either abbreviated, full, narrow and short forms of Day of Week d) accept truncated leftmost numeric field (eg, pattern "HHmmss" allows "123456" (12:34:56) and "23456" (2:34:56) but not "3456")
*2) Additional lenient parsing behaviour when in 'lax' mode:* a) values outside valid ranges are normalized (eg, "March 32 1996" is treated as "April 1 1996") b) ignoring a trailing dot after a non-numeric field c) leading and trailing whitespace in the data but not in the pattern is accepted **** d) whitespace in the pattern can be missing in the data e) partial matching on literal strings (eg, data "20130621d" allowed for pattern "yyyyMMdd'date' " ****
** Bug found when testing this - EEEEE 'narrow' form completely broken - ICU ticket raised.
*** EEEEEE and eeeeee are new and support a 2 char version of 'short' form - eg Tu or Mo. Not currently allowed by DFDL, we should consider allowing it.
**** Only currently in ICU4C. ICU4J will be changed to match ICU4C.
Note: IBM is in discussion with ICU to provide a 'really strict' mode (name tbd) which has no leniency at all. We need to decide whether to reflect all three variants in the dfdl:calendarCheckPolicy, or whether to remap our 'strict' to the new 'really strict' mode when it appears. Given where we are I think is a DFDL 2.0 item.
Regards
Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
-- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com

Agreed on call to add in these descriptions, minus the footnotes. Errata will be raised to add EEEEEE and eeeeee. There are several bugs in ICU, all of which should ideally be documented in the release notes for a DFDL implementation. The broken EEEEE behaviour and the ICU4C v ICU4J differences both come under this. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org> Date: 14/08/2013 14:23 Subject: Re: [DFDL-WG] Action 204: Establish strict versus lax behaviour for ICU calendar patterns This is helpful. Given where we are, let's just put this in as doc of what strict and lax mean. I'm in favor of adding the variations of EEEE... and eeee... which are supported by ICU. This is upward compatible, and will avoid need for a special check to exclude them. The broken EEEEE form is just a bug - I'd say this is just a release note item for products providing DFDL, unless ICU fixes it 'real soon now'. On Wed, Aug 14, 2013 at 8:58 AM, Steve Hanson <smh@uk.ibm.com> wrote: For the subset of ICU symbols that DFDL supports, here is what ICU claim: 1) Lenient parsing behaviour when in 'strict' mode: a) case insensitive matching for text fields b) MMM, MMMM, MMMMM all accept either short or long form of Month c) E, EE, EEE, EEEE, EEEEE **, EEEEEE *** all accept either abbreviated, full, narrow and short forms of Day of Week d) accept truncated leftmost numeric field (eg, pattern "HHmmss" allows "123456" (12:34:56) and "23456" (2:34:56) but not "3456") 2) Additional lenient parsing behaviour when in 'lax' mode: a) values outside valid ranges are normalized (eg, "March 32 1996" is treated as "April 1 1996") b) ignoring a trailing dot after a non-numeric field c) leading and trailing whitespace in the data but not in the pattern is accepted **** d) whitespace in the pattern can be missing in the data e) partial matching on literal strings (eg, data "20130621d" allowed for pattern "yyyyMMdd'date' " **** ** Bug found when testing this - EEEEE 'narrow' form completely broken - ICU ticket raised. *** EEEEEE and eeeeee are new and support a 2 char version of 'short' form - eg Tu or Mo. Not currently allowed by DFDL, we should consider allowing it. **** Only currently in ICU4C. ICU4J will be changed to match ICU4C. Note: IBM is in discussion with ICU to provide a 'really strict' mode (name tbd) which has no leniency at all. We need to decide whether to reflect all three variants in the dfdl:calendarCheckPolicy, or whether to remap our 'strict' to the new 'really strict' mode when it appears. Given where we are I think is a DFDL 2.0 item. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg -- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Adding these as Errata 2.150 (6 E's patterns added) and 2.151 clarification of strict/lax for calendarCheckPolicy. On Wed, Aug 14, 2013 at 12:20 PM, Steve Hanson <smh@uk.ibm.com> wrote:
Agreed on call to add in these descriptions, minus the footnotes. Errata will be raised to add EEEEEE and eeeeee. There are several bugs in ICU, all of which should ideally be documented in the release notes for a DFDL implementation. The broken EEEEE behaviour and the ICU4C v ICU4J differences both come under this.
Regards
Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848
From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org> Date: 14/08/2013 14:23 Subject: Re: [DFDL-WG] Action 204: Establish strict versus lax behaviour for ICU calendar patterns ------------------------------
This is helpful.
Given where we are, let's just put this in as doc of what strict and lax mean.
I'm in favor of adding the variations of EEEE... and eeee... which are supported by ICU. This is upward compatible, and will avoid need for a special check to exclude them.
The broken EEEEE form is just a bug - I'd say this is just a release note item for products providing DFDL, unless ICU fixes it 'real soon now'.
On Wed, Aug 14, 2013 at 8:58 AM, Steve Hanson <*smh@uk.ibm.com*<smh@uk.ibm.com>> wrote: For the subset of ICU symbols that DFDL supports, here is what ICU claim: * 1) Lenient parsing behaviour when in 'strict' mode: * a) case insensitive matching for text fields b) MMM, MMMM, MMMMM all accept either short or long form of Month c) E, EE, EEE, EEEE, EEEEE **, EEEEEE *** all accept either abbreviated, full, narrow and short forms of Day of Week d) accept truncated leftmost numeric field (eg, pattern "HHmmss" allows "123456" (12:34:56) and "23456" (2:34:56) but not "3456") * 2) Additional lenient parsing behaviour when in 'lax' mode:* a) values outside valid ranges are normalized (eg, "March 32 1996" is treated as "April 1 1996") b) ignoring a trailing dot after a non-numeric field c) leading and trailing whitespace in the data but not in the pattern is accepted **** d) whitespace in the pattern can be missing in the data e) partial matching on literal strings (eg, data "20130621d" allowed for pattern "yyyyMMdd'date' " ****
** Bug found when testing this - EEEEE 'narrow' form completely broken - ICU ticket raised.
*** EEEEEE and eeeeee are new and support a 2 char version of 'short' form - eg Tu or Mo. Not currently allowed by DFDL, we should consider allowing it.
**** Only currently in ICU4C. ICU4J will be changed to match ICU4C.
Note: IBM is in discussion with ICU to provide a 'really strict' mode (name tbd) which has no leniency at all. We need to decide whether to reflect all three variants in the dfdl:calendarCheckPolicy, or whether to remap our 'strict' to the new 'really strict' mode when it appears. Given where we are I think is a DFDL 2.0 item.
Regards
Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:*+44-1962-815848* <%2B44-1962-815848> Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- dfdl-wg mailing list *dfdl-wg@ogf.org* <dfdl-wg@ogf.org> *https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
-- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | * www.tresys.com* <http://www.tresys.com/>
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
-- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
participants (2)
-
Mike Beckerle
-
Steve Hanson