For the subset of ICU symbols that DFDL supports, here is what ICU claim:
1) Lenient parsing behaviour when in 'strict' mode:
a) case insensitive matching for text fields
b) MMM, MMMM, MMMMM all accept either short or long form of Month
c) E, EE, EEE, EEEE, EEEEE **, EEEEEE *** all accept either abbreviated, full, narrow and short forms of Day of Week
d) accept truncated leftmost numeric field (eg, pattern "HHmmss" allows "123456" (12:34:56) and "23456" (2:34:56) but not "3456")
2) Additional lenient parsing behaviour when in 'lax' mode:
a) values outside valid ranges are normalized (eg, "March 32 1996" is treated as "April 1 1996")
b) ignoring a trailing dot after a non-numeric field
c) leading and trailing whitespace in the data but not in the pattern is accepted ****
d) whitespace in the pattern can be missing in the data
e) partial matching on literal strings (eg, data "20130621d" allowed for pattern "yyyyMMdd'date' " ****
** Bug found when testing this - EEEEE 'narrow' form completely broken - ICU ticket raised.
*** EEEEEE and eeeeee are new and support a 2 char version of 'short' form - eg Tu or Mo. Not currently allowed by DFDL, we should consider allowing it.
**** Only currently in ICU4C. ICU4J will be changed to match ICU4C.
Note: IBM is in discussion with ICU to provide a 'really strict' mode (name tbd) which has no leniency at all. We need to decide whether to reflect all three variants in the dfdl:calendarCheckPolicy, or whether to remap our 'strict' to the new 'really strict' mode when it appears. Given where we are I think is a DFDL 2.0 item.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg