New further simplified proposal for rewrite of errata 3.13.

Goal: allow simple implementations where a flag is simply set in the decoder/encoder.

Dropped the problematic 'skip' option.

I also incorporated a point about UTF-16 errors from prior email on this topic from Tim.

I added a point about how many Unicode Replacement Characters are inserted when the data contains multiple errors adjacent. Basically, I made this implementation dependent.

There are two TBDs in the text below that I'd like to resolve.

---------------------------------------------------------

Errata 3.13 (Revised)

A new sub-section is added to section 11. (this is probably 11.2, if 11.1 is about Unicode byte order marks)

11.2    Character Encoding and Decoding Errors

When parsing, these are the errors that can occur when decoding characters into Unicode/ISO 10646.

1.    The data is broken - invalid bit/byte sequences are found which do not match the definition of a character for the encoding.
2.    Not enough data is found to make up the entire encoding of a character. That is, a fragment of a valid encoding is found.

When unparsing, these are the errors that can occur when encoding characters from Unicode/ISO 10646 into the specified encoding.

1.    No mapping provided by the encoding specification.
2.    Not enough room to output the entire encoding of the character (e.g., need 3 bytes for a character encoding that uses 3-bytes for that character, but only 1 byte remains in the available length.

The subsections below describe how these errors are handled.

11.2.1 property dfdl:encodingErrorPolicy

The property dfdl:encodingErrorPolicy has two possible values: 'error' and 'replace'.

11.2.1.1 dfdl:encodingErrorPolicy='error'

If 'error', then any error when decoding characters while parsing causes a parse error. For unparsing, any error when encoding characters causes an unparse error.

When parsing, it does not matter if this happens when scanning for delimiters, matching a regular expression, matching a literal nil value, or constructing the value of a textual element.

11.2.1.2 dfdl:encodingErrorPolicy='replace' for Parsing

If 'replace' then any error results in the insertion of the Unicode Replacement Character (U+FFFD) as the replacement for that error.

It does not matter if this error and replacement happens when scanning for delimiters, matching a regular expression, matching a literal nil value, or constructing the value of a textual element.

Note however, that unless a DFDL schema specifically uses the Unicode Replacement Character in a delimiter or nil value, then this character is certain to not match.

TBD: I believe we should disallow using the Unicode Replacement Character in a delimiter, as a pad char, in a pattern, as a feature character (text number group separator, in the text representation of NaN, boolean true, calendarPattern, etc, etc.) We could simply say it is an SDE if this appears in any of these locations.

Note that the "." wildcard in regular expressions will match the Unicode Replacement Character, so ".*" and ".+" regular expressions can potentially cause very large matches (up to the entire data stream) to occur when data contains errors and dfdl:encodingErrorPolicy='replace'. Bounded length regular expressions can help in this case. E.g., ".{0,50}" says to match any character (including Unicode Replacement Characters), but only up to length 50.

It is also worth noting that the Unicode Replacement Character can appear in data as an ordinary character, and this cannot be distinguished from the insertion of the Unicode Replacement Character due to a decode error.

If lengthUnits='characters', then a Unicode Replacement Character counts as contributing a single character to the length.

If the data contains more than one adjacent decode error, then the specific number of Unicode Replacement Characters that are inserted as the replacement of these errors is implementation dependent. That is, some implementations may view, for example, three consecutive erroneous bytes as three separate decode errors, others may view them as a single or two decode errors. All implementations MUST, however, insert some number of Unicode Replacement Characters, and then continue to decode characters following the erroneous data.

The trimming of padding characters always happens after Unicode Replacement Characters have been inserted into the data.

11.2.1.3 dfdl:encodingErrorPolicy='replace' for Unparsing

For unparsing, each encoding has a replacement/substitution character specified by the ICU. This character is substituted for the unmapped character or the character that has too large an encoding to fit in the available space.

The definitions of these substitution characters can be conveniently found for many encodings in the ICU Converter Explorer (http://demo.icu-project.org/icu-bin/convexp). 

An encoding error is an unparse error if the encoding does not provide a substitution/replacement character definition. (This would be rare, but could occur if a DFDL implementation allows many encodings beyond the minimum set.)

TBD: should we rule out this case by providing some default mapping that can always be used. E.g., in the above corner case '?' is used as the substitution character.

11.2.1.4 Parsing: The Not-Enough-Data Decode Error

There is one special case for the 'not enough data' decode error. For lengthUnits='bytes' when the encoding is a fixed-width character set (see section 12.3.7.1.1 Character Width). If the number of bytes is not a multiple of the character set width, then there will be some number of bytes left over at the end of the data which are insufficient to hold an entire character code. In this case no attempt is made to decode a character from these left-over bytes. They are skipped when parsing (and filled with the dfdl:fillByte on unparsing).

11.2.1.5  Parsing: Unicode Decoding Non-Errors

The following specific situations involving encodings UTF-16, UTF-16LE, and UTF-16BE when utf16Width="fixed", and they do not cause a decoding or encoding error.
•    unpaired surrogate code-point
•    out-of-order surrogate code-point pair
•    surrogate code point pair is encountered

In all these cases the code-point(s) becomes a character code in the DFDL Information Item for the string.

11.2.1.5 Unparsing: The Not-Enough-Room Encoding Error

There is one special case for the 'not enough room' encoding error. For lengthUnits='bytes' when the encoding is a fixed-width character set (see section 12.3.7.1.1 Character Width). If the number of bytes is not a multiple of the character set width, then there will be some number of bytes of space left over at the end of the data which are insufficient to hold an entire character code. In this case no attempt is made to encode a character into these left-over bytes. They are filled with the dfdl:fillByte. (On parsing they are skipped.)

11.2.2    Preserving Data Containing Decoding Errors

There can be situations where data wants to be preserved exactly even if it contains errors.

It is suggested that if a DFDL schema author wants to preserve information containing data where the data may have decoding errors, that they model such data as xs:hexBinary, or as xs:string but using an encoding such as iso-8859-1 which preserves all bytes.


--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com