Mike, I think this proposal looks good and provides an adequate solution
for DFDL 1.0. Let's discuss further on today's WG call.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: Mike Beckerle
To: Steve Hanson/UK/IBM@IBMGB
Cc: Andreas Martens1/UK/IBM@IBMGB, dfdl-wg@ogf.org
Date: 07/12/2011 15:02
Subject: Re: [DFDL-WG] Issue 156 - ICU fallback mappings -
character encoding/decoding errors (version 2 - modified per call
2011-12-06)
Alright, I was able to convince myself that a substitution character is
available, and associated with the IANA character set ID aliases. Even
us-ascii has one (\x1A) E.g.,
http://demo.icu-project.org/icu-bin/convexp?conv=US-ASCII&s=ALL
So our original language that said to just use "the replacement character
for the encoding" was actually correct!
Revised proposal below. Basically, it's just error, skip or replace flag
for encoding error policy. We still have to figure out the TBDs in there
with respect to how many substitution/replacements will occur, and what to
do about some of these Unicode-encoding related issues.
...mikeb
---------------------------------------------------------------------------
Issue 156 - ICU fallback mappings - character encoding/decoding errors
(modified per email thread on standardized ICU substitution/replacement
characters)
(Modified per workgroup discussion on 2011-12-06 - removed rationale and
discussion, simplified to just the minimum. Note couple of important TBDs
in here. Topics we forgot to discuss.)
Summary
DFDL currently does not have adequate capability to handle encoding and
decoding errors. Language in the spec is incorrect/infeasible to
implement. ICU provides mechanisms giving degree of control over this
issue, the question is whether and how to embrace those mechanisms, or
provide some other alternative solution.
Discussion
This language in section 4.1.2 about character set decoding/encoding just
doesn't work:
This first part is unacceptable because it fails to specify what happens
when the decoding fails because of data errors.
During parsing, characters whose value is unknown or unrepresentable in
ISO 10646 are replaced by the Unicode Replacement Character U+FFFD.
This second part also is inadequate:
During unparsing, characters that are unrepresentable in the target
encoding will be replaced by the replacement character for that encoding.
This needs a citation for where these replacement characters are
specified. It also needs to specify what happens in certain error
situations.
Suggested Resolution: Summary
DFDL property dfdl:encodingErrorPolicy with values 'skip', 'error',
'replace'
For Parsing/Decoding Errors
There are two errors that can occur when decoding characters into
Unicode/ISO 10646.
1. the data is broken - invalid byte sequences that don't match the
definition of the encoding are encountered.
2. not enough bytes are found to make up the entire encoding of a
character. That is, a fragment of a valid encoding is found.
The behavior in these cases is controlled by
dfdl:inputEncodingErrorPolicy.
If 'replace', then the Unicode replacement character '�' (U+FFFD) is
substituted for the offending bytes, one replacement character for each
invalid byte, one replacement character for any fragment of an encoding.
(TBD: Should this say 'byte' or 'unit' ?? I.e., in UTF-16BE, will ICU
error callback occur once for a broken codepoint, or once per byte?)
(TBD: Assumptions to validate: I am assuming here that if there are 6
invalid bytes, none of which can validly be unit 1 of the encoding of any
character, that ICU will call the error hook either (a) 6 times, or (b)
once but notifying about all 6 bad units - but providing a way for the
hook-writer to say they want to substitute 6 characters for the 6 units.
I am also assuming in the end-of-data fragment case that the ICU hook gets
called once for the fragment, not once per byte of the fragment.)
(TBD: We did not discuss on the call on Dec 6th, the issue of errors in
unicode encodings. While there are no encodings where a properly encoded
character is unmapped to unicode, the unicode UTF encodings themselves can
contains things that are errors. Here's a short list of some things that
can happen:
utf-16 and unpaired surrogate code-point
utf-16 and out-of-order surrogate code-point pair
utf-8 parsing and 3-byte encoding of a surrogate code-point is found
utf-8 unparsing and code-point of an isolated surrogate is to be encoded.
utf-8 decoding, and if you assemble the bits the usual way, you get a code
point out of range (higher than 0x10FFFF)
utf-8 encoding, and code-point to encode is higher than 0x10FFFF.
utf-16 encoding utf16Width='fixed' and a surrogate code point is
encountered
utf-16 byte-order-marks found not at the beginning of the data
We have an option here to be 'tolerant' of unicode-encoding foibles. We
can preserve isolated surrogates in a natural way if we wish. I believe
many Unicode and UTF implementations tolerate these situations. For
example the standard Java utf-8 decoder/encoder InputStreamReader and
OutputStreamWriter, is tolerant of incorrectly paired and isolated
surrogate code points in the Java string data.
I do not know what ICU does in these cases, i.e., if it provides us enough
flexibility to do whatever we want, or if it doesn't even detect some of
these things as errors.)
If 'skip' then the invalid byte sequences are dropped/ignored. No
corresponding characters are created in the DFDL infoset.
If 'error' then a processing error occurs.
It is suggested that if a DFDL user wants to preserve information
containing data where the encodings have these kinds of errors, that they
model such data as xs:hexBinary, or as a xs:string, but using an encoding
such as iso-8859-1 which preserves all bytes.
Suggested Resolution - Unparsing/Encoding Errors
The following are kinds of errors when encoding characters:
1. no mapping provided by the encoding specification.
2. not enough room to output the entire encoding of the character
(e.g., need 2 bytes for a DBCS, but only 1 byte remains in the available
length.
The behavior in these cases is controlled by dfdl:encodingErrorPolicy.
If the policy is 'error' then a processing error occurs.
If the policy is 'skip' then the character is skipped. No character is
encoded to be output for case 1, and no partial character is attempted in
case 2.
If the policy is 'replace' then the behavior is determined by the encoding
specification.
Each encoding has a replacement/substitution character specified by the
ICU. These can be found conveniently in the ICU Converter Explorer. This
character is substituted for the unmapped character or the character that
has too large an encoding (errors 1, and 2 above).
It is a processing error if it is not possible to output the replacement
character because there is not enough room for its representation.
It is a processing error if a character encoding does not provide a
substitution/replacement character definition and one is needed because of
dfdl:encodingErrorPolicy='replace'. (This would be rare, but could occur
if a DFDL implementation allows many encodings beyond the minimum set.)
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU