[DFDL-WG] 3.13 on encoding errors - rewording - was Re: Second draft of DFDL Errata v011

4 Dec 2012

      *Re: Errata v11 3.13

I concluded that this needs a rewrite entirely, and that the policy we
adopt has to allow for a reasonable implementation of scanning that doesn't
require re-invention of the entire world of character I/O.

I believe no changes are needed to section 4.1.2, so this errata is not
about that section anymore, only about section 11.

Suggested revised errata 3.13 text follows:*

The encodingErrorPolicy property specified in Errata v10 is removed.

A new sub-section is added to section 11. *(this is probably 11.2, if 11.1
is about Unicode byte order marks)*

11.2    Character Encoding and Decoding Errors

When parsing, these are the errors that can occur when decoding characters
into Unicode/ISO 10646.

1.    The data is broken - invalid bit/byte sequences are found which do
not match the definition of a character for the encoding.
2.    Not enough data is found to make up the entire encoding of a
character. That is, a fragment of a valid encoding is found.

When unparsing, these are the errors that can occur when encoding
characters from Unicode/ISO 10646 into the specified encoding.

1.    No mapping provided by the encoding specification.
2.    Not enough room to output the entire encoding of the character (e.g.,
need 3 bytes for a character encoding that uses 3-bytes for that character,
but only 1 byte remains in the available length.

The subsections below describe how these errors are handled.

11.2.1  Parsing: Decoding Errors When Scanning

Scanning is searching of text for data matching:

   - a delimiter (this can be with or without an associated escape scheme)
      - initiators, terminators, and separators
      - a literal nil value (nilKind of 'literalValue' or
   'literalCharacter')
   - a regular expression
      - for lengthKind='pattern'
      - for dfdl:assert or dfdl:discriminator with testKind='pattern'

In all cases, if scanning encounters a decode error, then it stops with the
the successfully decoded character immediately preceding the decode-error.
This means that neither framing nor content that is scanned can ever
encompass data that caused decode errors.

It is worth specific note that the "." regular expression character which
matches any character, does not allow a regular expression to subsume a
decode error.

It is also worth noting that the Unicode Replacement Character (U+FFFD) can
appear in data as an ordinary character, but the scanning process will
never create these from decode-errors.

11.2.2 Parsing: Decode Errors in Elements of Specified Length

For elements with specified lengths (see glossary), the content region is
processed and decode errors in conversion of this content region to text
will result in the replacement of the non-decodable part of the data by one
or more Unicode replacement characters (U+FFFD). If
lengthUnits='characters', then such a Unicode replacement character counts
as contributing a single character to the length.

The trimming of padding characters happens after this replacement.

11.2.2.2 Parsing: The Not-Enough-Data Decode Error

There is one special case for the 'not enough data' decode error. For
lengthUnits='bytes' when the encoding is a fixed-width character set (see
section 12.3.7.1.1 Character Width). If the number of bytes is not a
multiple of the character set width, then there will be some number of
bytes left over at the end of the data which are insufficient to hold an
entire character code. In this case no attempt is made to decode a
character from these left-over bytes. They are skipped when parsing (and
filled with the dfdl:fillByte on unparsing).

11.2.3  Parsing: Unicode Decoding Non-Errors

The following specific situations involving UTF-16, UTF-16LE, and UTF-16BE
and do not cause a decoding or encoding error.
•    UTF-16 and unpaired surrogate code-point
•    UTF-16 and out-of-order surrogate code-point pair
•    UTF-16 encoding utf16Width='fixed' and a surrogate code point is
encountered
•    UTF-16 byte-order-marks found not at the beginning of the data

In all these cases the code-point becomes a character code in the DFDL
Information Item for the string.

11.2.4   Unparsing: Substitution Character

For unparsing, each encoding has a replacement/substitution character
specified by the ICU. This character is substituted for the unmapped
character or the character that has too large an encoding.

The definitions of these substitution characters can be conveniently found
for many encodings in the ICU Converter Explorer (
http://demo.icu-project.org/icu-bin/convexp).

An encoding error is a processing error if the encoding does not provide a
substitution/replacement character definition. (This would be rare, but
could occur if a DFDL implementation allows many encodings beyond the
minimum set.)

11.2.4.1 Unparsing: The Not-Enough-Room Encoding Error

There is one special case for the 'not enough room' encoding error. For
lengthUnits='bytes' when the encoding is a fixed-width character set (see
section 12.3.7.1.1 Character Width). If the number of bytes is not a
multiple of the character set width, then there will be some number of
bytes of space left over at the end of the data which are insufficient to
hold an entire character code. In this case no attempt is made to encode a
character into these left-over bytes. They are filled with the
dfdl:fillByte. (On parsing they are skipped.)

11.2.5    Preserving Data Containing Decoding Errors

There can be situations where data wants to be preserved exactly even if it
contains errors.

It is suggested that if a DFDL schema author wants to preserve information
containing data where the data may have decoding of errors, that they model
such data as xs:hexBinary, or as xs:string but using an encoding such as
iso-8859-1 which preserves all bytes.

[DFDL-WG] 3.13 on encoding errors - rewording - was Re: Second draft of DFDL Errata v011

Mike Beckerle