
*Re: Errata v11 3.13 I concluded that this needs a rewrite entirely, and that the policy we adopt has to allow for a reasonable implementation of scanning that doesn't require re-invention of the entire world of character I/O. I believe no changes are needed to section 4.1.2, so this errata is not about that section anymore, only about section 11. Suggested revised errata 3.13 text follows:* The encodingErrorPolicy property specified in Errata v10 is removed. A new sub-section is added to section 11. *(this is probably 11.2, if 11.1 is about Unicode byte order marks)* 11.2 Character Encoding and Decoding Errors When parsing, these are the errors that can occur when decoding characters into Unicode/ISO 10646. 1. The data is broken - invalid bit/byte sequences are found which do not match the definition of a character for the encoding. 2. Not enough data is found to make up the entire encoding of a character. That is, a fragment of a valid encoding is found. When unparsing, these are the errors that can occur when encoding characters from Unicode/ISO 10646 into the specified encoding. 1. No mapping provided by the encoding specification. 2. Not enough room to output the entire encoding of the character (e.g., need 3 bytes for a character encoding that uses 3-bytes for that character, but only 1 byte remains in the available length. The subsections below describe how these errors are handled. 11.2.1 Parsing: Decoding Errors When Scanning Scanning is searching of text for data matching: - a delimiter (this can be with or without an associated escape scheme) - initiators, terminators, and separators - a literal nil value (nilKind of 'literalValue' or 'literalCharacter') - a regular expression - for lengthKind='pattern' - for dfdl:assert or dfdl:discriminator with testKind='pattern' In all cases, if scanning encounters a decode error, then it stops with the the successfully decoded character immediately preceding the decode-error. This means that neither framing nor content that is scanned can ever encompass data that caused decode errors. It is worth specific note that the "." regular expression character which matches any character, does not allow a regular expression to subsume a decode error. It is also worth noting that the Unicode Replacement Character (U+FFFD) can appear in data as an ordinary character, but the scanning process will never create these from decode-errors. 11.2.2 Parsing: Decode Errors in Elements of Specified Length For elements with specified lengths (see glossary), the content region is processed and decode errors in conversion of this content region to text will result in the replacement of the non-decodable part of the data by one or more Unicode replacement characters (U+FFFD). If lengthUnits='characters', then such a Unicode replacement character counts as contributing a single character to the length. The trimming of padding characters happens after this replacement. 11.2.2.2 Parsing: The Not-Enough-Data Decode Error There is one special case for the 'not enough data' decode error. For lengthUnits='bytes' when the encoding is a fixed-width character set (see section 12.3.7.1.1 Character Width). If the number of bytes is not a multiple of the character set width, then there will be some number of bytes left over at the end of the data which are insufficient to hold an entire character code. In this case no attempt is made to decode a character from these left-over bytes. They are skipped when parsing (and filled with the dfdl:fillByte on unparsing). 11.2.3 Parsing: Unicode Decoding Non-Errors The following specific situations involving UTF-16, UTF-16LE, and UTF-16BE and do not cause a decoding or encoding error. • UTF-16 and unpaired surrogate code-point • UTF-16 and out-of-order surrogate code-point pair • UTF-16 encoding utf16Width='fixed' and a surrogate code point is encountered • UTF-16 byte-order-marks found not at the beginning of the data In all these cases the code-point becomes a character code in the DFDL Information Item for the string. 11.2.4 Unparsing: Substitution Character For unparsing, each encoding has a replacement/substitution character specified by the ICU. This character is substituted for the unmapped character or the character that has too large an encoding. The definitions of these substitution characters can be conveniently found for many encodings in the ICU Converter Explorer ( http://demo.icu-project.org/icu-bin/convexp). An encoding error is a processing error if the encoding does not provide a substitution/replacement character definition. (This would be rare, but could occur if a DFDL implementation allows many encodings beyond the minimum set.) 11.2.4.1 Unparsing: The Not-Enough-Room Encoding Error There is one special case for the 'not enough room' encoding error. For lengthUnits='bytes' when the encoding is a fixed-width character set (see section 12.3.7.1.1 Character Width). If the number of bytes is not a multiple of the character set width, then there will be some number of bytes of space left over at the end of the data which are insufficient to hold an entire character code. In this case no attempt is made to encode a character into these left-over bytes. They are filled with the dfdl:fillByte. (On parsing they are skipped.) 11.2.5 Preserving Data Containing Decoding Errors There can be situations where data wants to be preserved exactly even if it contains errors. It is suggested that if a DFDL schema author wants to preserve information containing data where the data may have decoding of errors, that they model such data as xs:hexBinary, or as xs:string but using an encoding such as iso-8859-1 which preserves all bytes.