Public comment 116 - Japanese CCSID 943

Please see below discussion of the issue raised by public comment 116. While there is a simple workaround, the real issue is that ccsid 943 is in daily use in Japan for zoned decimals and DFDL does not support that. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 07/10/2013 14:05 ----- ICU library treats the ccsid '943' as ICU ibm-943_P130-1999 which is not 100% ASCII compatible due to two code points being different - 0x5C and 0x7E. There is another encoding ICU ibm-943_P15A-2003 which is ASCII compatible, commonly called Shift_JIS. The difference is one that you are probably familiar with - the backslash is replaced by the Yen symbol. Here is the extract from ICU converter site: Internal Converter Name IBM IANA ibm-943_P15A-2003 Shift_JIS MS_Kanji csShiftJIS windows-31j csWindows31J ibm-943_P130-1999 ibm-943 This causes DFDL to reject the ccsid '943' when used on a zoned decimal on the grounds that it is not ASCII-compatible. Why does DFDL do this? It's because it is being safe. So far, we have identified 4 different 'overpunching' schemes for ASCII zoned decimals, and there might well be one or two more used by less common machine architectures: asciiStandard: ASCII characters '0123456789' (0x30-0x39) and 'pqrstuvwxy' (0x70-0x79) for negative sign punch. asciiTranslatedEBCDIC: ASCII characters '{ABCDEFGHI' (0x7B, 0x41-0x49) and '}JKLMNOPQR' (0x7D, 0x4A-0x52) for negative sign punch. asciiCARealiaModified: ASCII characters '0123456789' (0x30-0x39) and '<SP>!"#$%&'()' (0x20-0x29) for negative sign punch. asciiTandemModified: ASCII characters '0123456789' (0x30-0x39) and control characters 0x80-0x89 for negative sign punch. In case other schemes are discovered that use different byte range for negative sign punch (eg, 0x50 to 0x59), the DFDL specification has said that ASCII zoned decimals must be in a 100% ASCII compatible encoding. What we can observe is that ibm-943_P130-1999 is actually safe for representing ASCII zoned decimals in all the above schemes, because the 0x5C and 0x7E characters do not match any of the ranges of bytes used by the schemes. And we can also observe that (apart from the special case of asciiTranslatedEBCDIC) the overpunching schemes simply use the bits in the first nibble of the byte, so any new scheme we discover is very unlikely to affect 0x5C or 0x7E. So, the DFDL specification *could* be changed to treat ibm-943_P130-1999 as ASCII compatible for zoned decimals. The alternative is to use a work around whereby in a DFDL schema that models a data stream in ccsid 943, the default is 943 but zoned decimals override this and use Shift_JIS. When this workaround was discussed with a Japanese user of IBM DFDL, the reaction was "Why do I have to go to that trouble? It should just work in 943." Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

I suggest that 100% ascii compatible is too strong a statement and we can dial this back. Example: utf-8 will not handle asciiTandemModified style as that style uses the most-significant bit for the 0x80-0x89 negative sign + digit. That is, asciiTandemModified style is making use of extended-ascii codepoints, and there are many flavors of extended ascii with many different assignments of characters to codepoints for this range 80-89. So the question arises of who's in charge, the characters, or the codepoints? I.e., does the zoned sign style even care about the encoding, or does it translate the associated codepoints without concern for their mappings to actual characters. I think the codepoints should be in charge. So long as you have an encoding where all the characters used by your zoned decimals (as digits, or digits with overpunched signs) have legal codepoints in that encoding, you are good to go. If in one charset encoding one of the characters is assigned the euro symbol, and in other charset encodings that same codepoint is assigned some other character, then it should not matter so long as the codepoint doesn't cause a decode error. Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy<http://www.ogf.org/About/abt_policies.php> On Mon, Oct 7, 2013 at 9:30 AM, Steve Hanson <smh@uk.ibm.com> wrote:
Please see below discussion of the issue raised by public comment 116.
While there is a simple workaround, the real issue is that ccsid 943 is in daily use in Japan for zoned decimals and DFDL does not support that.
Regards
Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 07/10/2013 14:05 -----
ICU library treats the ccsid '943' as ICU *ibm-943_P130-1999*<http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P130-1999&s=IANA&s=IBM> which is not 100% ASCII compatible due to two code points being different - 0x5C and 0x7E. There is another encoding ICU *ibm-943_P15A-2003*<http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003&s=IANA&s=IBM> which is ASCII compatible, commonly called Shift_JIS. The difference is one that you are probably familiar with - the backslash is replaced by the Yen symbol. Here is the extract from ICU converter site: *Internal Converter Name* * IBM* *IANA* *ibm-943_P15A-2003*<http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P15A-2003&s=IANA&s=IBM> Shift_JIS MS_Kanji csShiftJIS windows-31j csWindows31J *ibm-943_P130-1999*<http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P130-1999&s=IANA&s=IBM> ibm-943
This causes DFDL to reject the ccsid '943' when used on a zoned decimal on the grounds that it is not ASCII-compatible. Why does DFDL do this? It's because it is being safe. So far, we have identified 4 different 'overpunching' schemes for ASCII zoned decimals, and there might well be one or two more used by less common machine architectures:
*asciiStandard*: ASCII characters '0123456789' (0x30-0x39) and 'pqrstuvwxy' (0x70-0x79) for negative sign punch.
*asciiTranslatedEBCDIC*: ASCII characters '{ABCDEFGHI' (0x7B, 0x41-0x49) and '}JKLMNOPQR' (0x7D, 0x4A-0x52) for negative sign punch.
*asciiCARealiaModified:* ASCII characters '0123456789' (0x30-0x39) and '<SP>!"#$%&'()' (0x20-0x29) for negative sign punch.
*asciiTandemModified*: ASCII characters '0123456789' (0x30-0x39) and control characters 0x80-0x89 for negative sign punch.
In case other schemes are discovered that use different byte range for negative sign punch (eg, 0x50 to 0x59), the DFDL specification has said that ASCII zoned decimals must be in a 100% ASCII compatible encoding.
What we can observe is that *ibm-943_P130-1999*<http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P130-1999&s=IANA&s=IBM>is actually safe for representing ASCII zoned decimals in all the above schemes, because the 0x5C and 0x7E characters do not match any of the ranges of bytes used by the schemes. And we can also observe that (apart from the special case of asciiTranslatedEBCDIC) the overpunching schemes simply use the bits in the first nibble of the byte, so any new scheme we discover is very unlikely to affect 0x5C or 0x7E.
So, the DFDL specification *could* be changed to treat *ibm-943_P130-1999*<http://demo.icu-project.org/icu-bin/convexp?conv=ibm-943_P130-1999&s=IANA&s=IBM>as ASCII compatible for zoned decimals.
The alternative is to use a work around whereby in a DFDL schema that models a data stream in ccsid 943, the default is 943 but zoned decimals override this and use Shift_JIS. When this workaround was discussed with a Japanese user of IBM DFDL, the reaction was "Why do I have to go to that trouble? It should just work in 943."
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
participants (2)
-
Mike Beckerle
-
Steve Hanson