Proposed Errata Language: Issue 156 - ICU fallback mappings - character encoding/decoding errors

Below is proposed errata language for Issue 156 (ICU fallback mappings, and character encoding/decoding errors) * Errata - Character encoding/decoding errors* * * A format property is added to section 11 (properties common to both content and framing) at the bottom of the table. The property is encodingErrorPolicy it is an Enum. Valid values are 'skip', 'error', 'replace'* For Parsing/Decoding Errors* There are two errors that can occur when decoding characters into Unicode/ISO 10646. 1. the data is broken - invalid byte sequences that don't match the definition of the encoding are encountered. 2. not enough bytes are found to make up the entire encoding of a character. That is, a fragment of a valid encoding is found. The behavior in these cases is controlled by dfdl:inputEncodingErrorPolicy. If 'replace', then the Unicode *replacement character*<http://en.wikipedia.org/wiki/Replacement_character>'�' (U+FFFD) is substituted for the offending bytes with one replacement character for any incorrect fragment of an encoding. If 'skip' then the invalid byte sequences are dropped/ignored. No corresponding characters are created in the DFDL infoset. If 'error' then a processing error occurs. It is suggested that if a DFDL user wants to preserve information containing data where the encodings have these kinds of errors, that they model such data as xs:hexBinary, or as a xs:string, but using an encoding such as iso-8859-1 which preserves all bytes. * * The following specific situations involving UTF-16, UTF-16LE, and UTF-16BE do not cause an encoding error: - *utf-16 and unpaired surrogate code-point* - *utf-16 and out-of-order surrogate code-point pair* - *utf-16 encoding utf16Width='fixed' and a surrogate code point is encountered* - *utf-16 byte-order-marks found not at the beginning of the data* In all these cases the code-point becomes part of the DFDL Information Item for the string. *For Unparsing/Encoding Errors* The following are kinds of errors when unparsing characters: 1. no mapping provided by the encoding specification. 2. not enough room to output the entire encoding of the character (e.g., need 2 bytes for a DBCS, but only 1 byte remains in the available length. The behavior in these cases is controlled by dfdl:encodingErrorPolicy. If the policy is 'error' then a processing error occurs. If the policy is 'skip' then the character is skipped. No character is encoded to be output for case 1, and no partial character is attempted in case 2. If the policy is 'replace' then the behavior is determined by the encoding specification. Each encoding has a replacement/substitution character specified by the ICU. (These can be found conveniently in the *ICU Converter Explorer.*<http://demo.icu-project.org/icu-bin/convexp> This character is substituted for the unmapped character or the character that has too large an encoding.) It is a processing error if it is not possible to output the replacement character because there is not enough room for its representation. It is a processing error if a character encoding does not provide a substitution/replacement character definition and one is needed because of dfdl:encodingErrorPolicy='replace'. (This would be rare, but could occur if a DFDL implementation allows many encodings beyond the minimum set.)

Mike - one bug noted - (dfdl:inputEncodingError Policy should be just dfdl:encodingErrorPolicy) - otherwise good. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Steve Hanson/UK/IBM@IBMGB Cc: dfdl-wg@ogf.org Date: 16/01/2012 22:47 Subject: Proposed Errata Language: Issue 156 - ICU fallback mappings - character encoding/decoding errors Below is proposed errata language for Issue 156 (ICU fallback mappings, and character encoding/decoding errors) Errata - Character encoding/decoding errors A format property is added to section 11 (properties common to both content and framing) at the bottom of the table. The property is encodingErrorPolicy it is an Enum. Valid values are 'skip', 'error', 'replace' For Parsing/Decoding Errors There are two errors that can occur when decoding characters into Unicode/ISO 10646. 1. the data is broken - invalid byte sequences that don't match the definition of the encoding are encountered. 2. not enough bytes are found to make up the entire encoding of a character. That is, a fragment of a valid encoding is found. The behavior in these cases is controlled by dfdl:inputEncodingErrorPolicy. If 'replace', then the Unicode replacement character '�' (U+FFFD) is substituted for the offending bytes with one replacement character for any incorrect fragment of an encoding. If 'skip' then the invalid byte sequences are dropped/ignored. No corresponding characters are created in the DFDL infoset. If 'error' then a processing error occurs. It is suggested that if a DFDL user wants to preserve information containing data where the encodings have these kinds of errors, that they model such data as xs:hexBinary, or as a xs:string, but using an encoding such as iso-8859-1 which preserves all bytes. The following specific situations involving UTF-16, UTF-16LE, and UTF-16BE do not cause an encoding error: utf-16 and unpaired surrogate code-point utf-16 and out-of-order surrogate code-point pair utf-16 encoding utf16Width='fixed' and a surrogate code point is encountered utf-16 byte-order-marks found not at the beginning of the data In all these cases the code-point becomes part of the DFDL Information Item for the string. For Unparsing/Encoding Errors The following are kinds of errors when unparsing characters: 1. no mapping provided by the encoding specification. 2. not enough room to output the entire encoding of the character (e.g., need 2 bytes for a DBCS, but only 1 byte remains in the available length. The behavior in these cases is controlled by dfdl:encodingErrorPolicy. If the policy is 'error' then a processing error occurs. If the policy is 'skip' then the character is skipped. No character is encoded to be output for case 1, and no partial character is attempted in case 2. If the policy is 'replace' then the behavior is determined by the encoding specification. Each encoding has a replacement/substitution character specified by the ICU. (These can be found conveniently in the ICU Converter Explorer. This character is substituted for the unmapped character or the character that has too large an encoding.) It is a processing error if it is not possible to output the replacement character because there is not enough room for its representation. It is a processing error if a character encoding does not provide a substitution/replacement character definition and one is needed because of dfdl:encodingErrorPolicy='replace'. (This would be rare, but could occur if a DFDL implementation allows many encodings beyond the minimum set.) Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (2)
-
Mike Beckerle
-
Steve Hanson