3.13 on encoding errors - rewording - was Re: Second draft of DFDL Errata v011

*Re: Errata v11 3.13 I concluded that this needs a rewrite entirely, and that the policy we adopt has to allow for a reasonable implementation of scanning that doesn't require re-invention of the entire world of character I/O. I believe no changes are needed to section 4.1.2, so this errata is not about that section anymore, only about section 11. Suggested revised errata 3.13 text follows:* The encodingErrorPolicy property specified in Errata v10 is removed. A new sub-section is added to section 11. *(this is probably 11.2, if 11.1 is about Unicode byte order marks)* 11.2 Character Encoding and Decoding Errors When parsing, these are the errors that can occur when decoding characters into Unicode/ISO 10646. 1. The data is broken - invalid bit/byte sequences are found which do not match the definition of a character for the encoding. 2. Not enough data is found to make up the entire encoding of a character. That is, a fragment of a valid encoding is found. When unparsing, these are the errors that can occur when encoding characters from Unicode/ISO 10646 into the specified encoding. 1. No mapping provided by the encoding specification. 2. Not enough room to output the entire encoding of the character (e.g., need 3 bytes for a character encoding that uses 3-bytes for that character, but only 1 byte remains in the available length. The subsections below describe how these errors are handled. 11.2.1 Parsing: Decoding Errors When Scanning Scanning is searching of text for data matching: - a delimiter (this can be with or without an associated escape scheme) - initiators, terminators, and separators - a literal nil value (nilKind of 'literalValue' or 'literalCharacter') - a regular expression - for lengthKind='pattern' - for dfdl:assert or dfdl:discriminator with testKind='pattern' In all cases, if scanning encounters a decode error, then it stops with the the successfully decoded character immediately preceding the decode-error. This means that neither framing nor content that is scanned can ever encompass data that caused decode errors. It is worth specific note that the "." regular expression character which matches any character, does not allow a regular expression to subsume a decode error. It is also worth noting that the Unicode Replacement Character (U+FFFD) can appear in data as an ordinary character, but the scanning process will never create these from decode-errors. 11.2.2 Parsing: Decode Errors in Elements of Specified Length For elements with specified lengths (see glossary), the content region is processed and decode errors in conversion of this content region to text will result in the replacement of the non-decodable part of the data by one or more Unicode replacement characters (U+FFFD). If lengthUnits='characters', then such a Unicode replacement character counts as contributing a single character to the length. The trimming of padding characters happens after this replacement. 11.2.2.2 Parsing: The Not-Enough-Data Decode Error There is one special case for the 'not enough data' decode error. For lengthUnits='bytes' when the encoding is a fixed-width character set (see section 12.3.7.1.1 Character Width). If the number of bytes is not a multiple of the character set width, then there will be some number of bytes left over at the end of the data which are insufficient to hold an entire character code. In this case no attempt is made to decode a character from these left-over bytes. They are skipped when parsing (and filled with the dfdl:fillByte on unparsing). 11.2.3 Parsing: Unicode Decoding Non-Errors The following specific situations involving UTF-16, UTF-16LE, and UTF-16BE and do not cause a decoding or encoding error. • UTF-16 and unpaired surrogate code-point • UTF-16 and out-of-order surrogate code-point pair • UTF-16 encoding utf16Width='fixed' and a surrogate code point is encountered • UTF-16 byte-order-marks found not at the beginning of the data In all these cases the code-point becomes a character code in the DFDL Information Item for the string. 11.2.4 Unparsing: Substitution Character For unparsing, each encoding has a replacement/substitution character specified by the ICU. This character is substituted for the unmapped character or the character that has too large an encoding. The definitions of these substitution characters can be conveniently found for many encodings in the ICU Converter Explorer ( http://demo.icu-project.org/icu-bin/convexp). An encoding error is a processing error if the encoding does not provide a substitution/replacement character definition. (This would be rare, but could occur if a DFDL implementation allows many encodings beyond the minimum set.) 11.2.4.1 Unparsing: The Not-Enough-Room Encoding Error There is one special case for the 'not enough room' encoding error. For lengthUnits='bytes' when the encoding is a fixed-width character set (see section 12.3.7.1.1 Character Width). If the number of bytes is not a multiple of the character set width, then there will be some number of bytes of space left over at the end of the data which are insufficient to hold an entire character code. In this case no attempt is made to encode a character into these left-over bytes. They are filled with the dfdl:fillByte. (On parsing they are skipped.) 11.2.5 Preserving Data Containing Decoding Errors There can be situations where data wants to be preserved exactly even if it contains errors. It is suggested that if a DFDL schema author wants to preserve information containing data where the data may have decoding of errors, that they model such data as xs:hexBinary, or as xs:string but using an encoding such as iso-8859-1 which preserves all bytes.

New further simplified proposal for rewrite of errata 3.13. Goal: allow simple implementations where a flag is simply set in the decoder/encoder. Dropped the problematic 'skip' option. I also incorporated a point about UTF-16 errors from prior email on this topic from Tim. I added a point about how many Unicode Replacement Characters are inserted when the data contains multiple errors adjacent. Basically, I made this implementation dependent. There are two TBDs in the text below that I'd like to resolve. --------------------------------------------------------- Errata 3.13 (Revised) A new sub-section is added to section 11. *(this is probably 11.2, if 11.1 is about Unicode byte order marks)* 11.2 Character Encoding and Decoding Errors When parsing, these are the errors that can occur when decoding characters into Unicode/ISO 10646. 1. The data is broken - invalid bit/byte sequences are found which do not match the definition of a character for the encoding. 2. Not enough data is found to make up the entire encoding of a character. That is, a fragment of a valid encoding is found. When unparsing, these are the errors that can occur when encoding characters from Unicode/ISO 10646 into the specified encoding. 1. No mapping provided by the encoding specification. 2. Not enough room to output the entire encoding of the character (e.g., need 3 bytes for a character encoding that uses 3-bytes for that character, but only 1 byte remains in the available length. The subsections below describe how these errors are handled. 11.2.1 property dfdl:encodingErrorPolicy The property dfdl:encodingErrorPolicy has two possible values: 'error' and 'replace'. 11.2.1.1 dfdl:encodingErrorPolicy='error' If 'error', then any error when decoding characters while parsing causes a parse error. For unparsing, any error when encoding characters causes an unparse error. When parsing, it does not matter if this happens when scanning for delimiters, matching a regular expression, matching a literal nil value, or constructing the value of a textual element. 11.2.1.2 dfdl:encodingErrorPolicy='replace' for Parsing If 'replace' then any error results in the insertion of the Unicode Replacement Character (U+FFFD) as the replacement for that error. It does not matter if this error and replacement happens when scanning for delimiters, matching a regular expression, matching a literal nil value, or constructing the value of a textual element. Note however, that unless a DFDL schema specifically uses the Unicode Replacement Character in a delimiter or nil value, then this character is certain to not match. *TBD: I believe we should disallow using the Unicode Replacement Character in a delimiter, as a pad char, in a pattern, as a feature character (text number group separator, in the text representation of NaN, boolean true, calendarPattern, etc, etc.) We could simply say it is an SDE if this appears in any of these locations. * Note that the "." wildcard in regular expressions will match the Unicode Replacement Character, so ".*" and ".+" regular expressions can potentially cause very large matches (up to the entire data stream) to occur when data contains errors and dfdl:encodingErrorPolicy='replace'. Bounded length regular expressions can help in this case. E.g., ".{0,50}" says to match any character (including Unicode Replacement Characters), but only up to length 50. It is also worth noting that the Unicode Replacement Character can appear in data as an ordinary character, and this cannot be distinguished from the insertion of the Unicode Replacement Character due to a decode error. If lengthUnits='characters', then a Unicode Replacement Character counts as contributing a single character to the length. If the data contains more than one adjacent decode error, then the specific number of Unicode Replacement Characters that are inserted as the replacement of these errors is implementation dependent. That is, some implementations may view, for example, three consecutive erroneous bytes as three separate decode errors, others may view them as a single or two decode errors. All implementations MUST, however, insert some number of Unicode Replacement Characters, and then continue to decode characters following the erroneous data. The trimming of padding characters always happens after Unicode Replacement Characters have been inserted into the data. 11.2.1.3 dfdl:encodingErrorPolicy='replace' for Unparsing For unparsing, each encoding has a replacement/substitution character specified by the ICU. This character is substituted for the unmapped character or the character that has too large an encoding to fit in the available space. The definitions of these substitution characters can be conveniently found for many encodings in the ICU Converter Explorer ( http://demo.icu-project.org/icu-bin/convexp). An encoding error is an unparse error if the encoding does not provide a substitution/replacement character definition. (This would be rare, but could occur if a DFDL implementation allows many encodings beyond the minimum set.) *TBD: should we rule out this case by providing some default mapping that can always be used. E.g., in the above corner case '?' is used as the substitution character.* 11.2.1.4 Parsing: The Not-Enough-Data Decode Error There is one special case for the 'not enough data' decode error. For lengthUnits='bytes' when the encoding is a fixed-width character set (see section 12.3.7.1.1 Character Width). If the number of bytes is not a multiple of the character set width, then there will be some number of bytes left over at the end of the data which are insufficient to hold an entire character code. In this case no attempt is made to decode a character from these left-over bytes. They are skipped when parsing (and filled with the dfdl:fillByte on unparsing). 11.2.1.5 Parsing: Unicode Decoding Non-Errors The following specific situations involving encodings UTF-16, UTF-16LE, and UTF-16BE when utf16Width="fixed", and they do not cause a decoding or encoding error. • unpaired surrogate code-point • out-of-order surrogate code-point pair • surrogate code point pair is encountered In all these cases the code-point(s) becomes a character code in the DFDL Information Item for the string. 11.2.1.5 Unparsing: The Not-Enough-Room Encoding Error There is one special case for the 'not enough room' encoding error. For lengthUnits='bytes' when the encoding is a fixed-width character set (see section 12.3.7.1.1 Character Width). If the number of bytes is not a multiple of the character set width, then there will be some number of bytes of space left over at the end of the data which are insufficient to hold an entire character code. In this case no attempt is made to encode a character into these left-over bytes. They are filled with the dfdl:fillByte. (On parsing they are skipped.) 11.2.2 Preserving Data Containing Decoding Errors There can be situations where data wants to be preserved exactly even if it contains errors. It is suggested that if a DFDL schema author wants to preserve information containing data where the data may have decoding errors, that they model such data as xs:hexBinary, or as xs:string but using an encoding such as iso-8859-1 which preserves all bytes. -- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com

Revised per call on 2013-01-16 TBDs resolved. Language about lengthUnits='bytes' and fragment characters at the end changed to drop requirement of fixed-width characters. Since these rules are simpler now, I moved them directly into the primary sections describing the functionality rather than using sub-sections at the end. I will fold this into errata v11 (when I get it from Steve), to create v11.1.
---------------------------------------------------------
Errata 3.13 (Revised)
A new sub-section is added to section 11. *(this is probably 11.2, if 11.1 is about Unicode byte order marks)*
11.2 Character Encoding and Decoding Errors
When parsing, these are the errors that can occur when decoding characters into Unicode/ISO 10646.
1. The data is broken - invalid bit/byte sequences are found which do not match the definition of a character for the encoding. 2. Not enough data is found to make up the entire encoding of a character. That is, a fragment of a valid encoding is found.
When unparsing, these are the errors that can occur when encoding characters from Unicode/ISO 10646 into the specified encoding.
1. No mapping provided by the encoding specification. 2. Not enough room to output the entire encoding of the character (e.g., need 3 bytes for a character encoding that uses 3-bytes for that character, but only 1 byte remains in the available length.
The subsections below describe how these errors are handled.
11.2.1 property dfdl:encodingErrorPolicy
The property dfdl:encodingErrorPolicy has two possible values: 'error' and 'replace'.
11.2.1.1 dfdl:encodingErrorPolicy='error'
If 'error', then any error when decoding characters while parsing causes a parse error. For unparsing, any error when encoding characters causes an unparse error.
When parsing, it does not matter if this happens when scanning for delimiters, matching a regular expression, matching a literal nil value, or constructing the value of a textual element.
There is one exception. When lengthKind='bytes', the 'not enough data' decode error is ignored, and the data making up the fragment character is skipped over. Symmetrically, when unparsing the 'not enough room' encoding error is ignored and the left-over bytes are filled with the dfdl:fillByte.
11.2.1.2 dfdl:encodingErrorPolicy='replace' for Parsing
If 'replace' then any error results in the insertion of the Unicode Replacement Character (U+FFFD) as the replacement for that error.
It does not matter if this error and replacement happens when scanning for delimiters, matching a regular expression, matching a literal nil value, or constructing the value of a textual element.
There is one exception. When lengthKind='bytes', the 'not enough data' decode error is ignored, no replacement character is created. The data making up the fragment character is skipped over. (It will be filled with the dfdl:fillByte when unparsing.) The Unicode Replacement Character must not appear in any delimiter, padCharacter, nilValue, regular expression, textNumberPattern, or in any other property value or test pattern where the Unicode Replacement Character would be expected in the data being parsed. It is a schema definition error if the Unicode Replacement Character appears in any of these locations of a DFDL schema, or is part of the value of an expression that returns a string to be used as the value of a DFDL property. Note that the "." wildcard in regular expressions will match the Unicode
Replacement Character, so ".*" and ".+" regular expressions can potentially cause very large matches (up to the entire data stream) to occur when data contains errors and dfdl:encodingErrorPolicy='replace'. Bounded length regular expressions can help in this case. E.g., ".{0,50}" says to match any character (including Unicode Replacement Characters), but only up to length 50.
It is also worth noting that the Unicode Replacement Character can appear in data as an ordinary character, and this cannot be distinguished from the insertion of the Unicode Replacement Character due to a decode error.
If lengthUnits='characters', then a Unicode Replacement Character counts as contributing a single character to the length.
If the data contains more than one adjacent decode error, then the specific number of Unicode Replacement Characters that are inserted as the replacement of these errors is implementation dependent. That is, some implementations may view, for example, three consecutive erroneous bytes as three separate decode errors, others may view them as a single or two decode errors. All implementations MUST, however, insert some number of Unicode Replacement Characters, and then continue to decode characters following the erroneous data.
The trimming of padding characters always happens after Unicode Replacement Characters have been inserted into the data.
11.2.1.3 dfdl:encodingErrorPolicy='replace' for Unparsing
For unparsing, each encoding has a replacement/substitution character specified by the ICU. This character is substituted for the unmapped character or the character that has too large an encoding to fit in the available space.
There is one exception. When lengthKind='bytes', the 'not enough room' encoding error is ignored. The left-over bytes are filled with the dfdl:fillByte (they are skipped when parsing.)
The definitions of these substitution characters can be conveniently found for many encodings in the ICU Converter Explorer ( http://demo.icu-project.org/icu-bin/convexp).
An encoding error is an unparse error if the encoding does not provide a substitution/replacement character definition. (This would be rare, but could occur if a DFDL implementation allows many encodings beyond the minimum set.)
11.2.1.4 Parsing: Unicode Decoding Non-Errors
The following specific situations involving encodings UTF-16, UTF-16LE, and UTF-16BE when utf16Width="fixed", and they do not cause a decoding or encoding error. • unpaired surrogate code-point • out-of-order surrogate code-point pair • surrogate code point pair is encountered
In all these cases the code-point(s) becomes a character code in the DFDL Information Item for the string.
11.2.2 Preserving Data Containing Decoding Errors
There can be situations where data wants to be preserved exactly even if it contains errors.
It is suggested that if a DFDL schema author wants to preserve information containing data where the data may have decoding errors, that they model such data as xs:hexBinary, or as xs:string but using an encoding such as iso-8859-1 which preserves all bytes.
-- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
-- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
participants (1)
-
Mike Beckerle