Re: [DFDL-WG] Issue 156 - ICU fallback mappings - character encoding/decoding errors (version 2 - modified per call 2011-12-06)

7 Dec 2011

      Mike

Comments below (SMH).  I've noted what Andreas said in his reply.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From:   Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:     dfdl-wg@ogf.org, Andreas Martens1/UK/IBM@IBMGB
Date:   06/12/2011 17:35
Subject:        [DFDL-WG] Issue 156 - ICU fallback mappings - character 
encoding/decoding errors (version 2 - modified per call 2011-12-06)
Sent by:        dfdl-wg-bounces@ogf.org

Issue 156 - ICU fallback mappings - character encoding/decoding errors

(Modified per workgroup discussion on 2011-12-06 - removed rationale and 
discussion, simplified to just the minimum. Note couple of important TBDs 
in here. Topics we forgot to discuss.)

Summary

DFDL currently does not have adequate capability to handle encoding and 
decoding errors. Language in the spec is incorrect/infeasible to 
implement. ICU provides mechanisms giving degree of control over this 
issue, the question is whether and how to embrace those mechanisms, or 
provide some other alternative solution.

Discussion

This language in section 4.1.2 about character set decoding/encoding just 
doesn't work:

This first part is unacceptable because it fails to specify what happens 
when the decoding fails because of errors. It specifies what to do when 
there is no mapping to Unicode (which is, frankly, a very unlikely 
situation today) meaning a character is legally decoded, but then has no 
mapping.

During parsing, characters whose value is unknown or unrepresentable in 
ISO 10646 are replaced by the Unicode Replacement Character U+FFFD. 

This second part also fails to work:

During unparsing, characters that are unrepresentable in the target 
encoding will be replaced by the replacement character for that encoding.

Sounds symmetric and expedient, but the problem is that some character 
encodings have no reserved replacement character, and we expect that DFDL 
users will need a variety of different choices for how to deal with 
characters that cannot be encoded. 
SMH: From what Andreas says, the vast majority of encodings supported by 
ICU have a replacement character.
Suggested Resolution: Summary
DFDL property dfdl:inputEncodingErrorPolicy with values 'skip', 'error', 
'replace'
DFDL property dfdl:outputEncodingErrorPolicy with values 'skip', 'error', 
'replace'
DFDL annotation element. Example: <dfdl:encodingReplacementCharacter 
encoding="ASCII" character="%#x7c;"/> 
SMH: If the vast majority of encodings have a replacement character, then 
we don't need the new annotation.  
If we do keep it, then 
1) Not clear from the syntax that this is intended for unparse only. The 
word 'output' should be in there, eg, outputEncodingReplacementCharacter. 
2) As proposed there is no way to place encoding modifiers in one xsd and 
have them picked up by another. We need a scoping mechanism.

My vote for 1.0 is not to add the new annotation, just the new properties.
Can we get away with just one property that applies to both input and 
output? Most DFDL properties apply to both. 

For Parsing/Decoding Errors

There are two errors that can occur when decoding characters into 
Unicode/ISO 10646. 
1.      the data is broken - invalid byte sequences that don't match the 
definition of the encoding are encountered.
2.      not enough bytes are found to make up the entire encoding of a 
character. That is, a fragment of a valid encoding is found.
The behavior in these cases is controlled by 
dfdl:inputEncodingErrorPolicy.

If 'replace', then the Unicode replacement character '�' (U+FFFD) is 
substituted for the offending bytes, one replacement character for each 
invalid byte, one replacement character for any fragment of an encoding.

(TBD: Assumptions to validate: I am assuming here that if there are 6 
invalid bytes, none of which can validly be unit 1 of the encoding of any 
character, that ICU will call the error hook either (a) 6 times, or (b) 
once but notifying about all 6 bad units - but providing a way for the 
hook-writer to say they want to substitute 6 characters for the 6 units.

I am also assuming in the end-of-data fragment case that the ICU hook gets 
called once for the fragment, not once per byte of the fragment.)

(TBD: We did not discuss on the call on Dec 6th, the issue of errors in 
unicode encodings. While there are no encodings where a properly encoded 
character is unmapped to unicode, the unicode UTF encodings themselves can 
contains things that are errors. Here's a short list of some things that 
can happen:
utf-16 and unpaired surrogate code-point
utf-16 and out-of-order surrogate code-point pair
utf-8 parsing and 3-byte encoding of a surrogate code-point is found
utf-8 unparsing and code-point of an isolated surrogate is to be encoded.
utf-8 decoding, and if you assemble the bits the usual way, you get a code 
point out of range (higher than 0x10FFFF)
utf-8 encoding, and code-point to encode is higher than 0x10FFFF. 
utf-16 encoding utf16Width='fixed' and a surrogate code point is 
encountered
utf-16 byte-order-marks found not at the beginning of the data
We have an option here to be 'tolerant' of unicode-encoding foibles. We 
can preserve isolated surrogates in a natural way if we wish. I believe 
many Unicode and UTF implementations tolerate these situations. For 
example the standard Java utf-8 decoder/encoder InputStreamReader and 
OutputStreamWriter, is tolerant of incorrectly paired and isolated 
surrogate code points in the Java string data. 

I do not know what ICU does in these cases, i.e., if it provides us enough 
flexibility to do whatever we want, or if it doesn't even detect some of 
these things as errors.)

If 'skip' then the invalid byte sequences are dropped/ignored. No 
corresponding characters are created in the DFDL infoset.

If 'error' then a processing error occurs.

It is suggested that if a DFDL user wants to preserve information 
containing data where the encodings have these kinds of errors, that they 
model such data as xs:hexBinary, or as a xs:string, but using an encoding 
such as iso-8859-1 which preserves all bytes.

Note for errata: The language in section 4.1.2 Item 5 about decoding data 
into infoset Unicode has to change of course as well.

Suggested Resolution - Unparsing/Encoding Errors

The following are kinds of errors when encoding characters:
1.      no mapping provided by the encoding specification.
2.      not enough room to output the entire encoding of the character 
(e.g., need 2 bytes for a DBCS, but only 1 byte remains in the available 
length.
The behavior in these cases is controlled by 
dfdl:outputEncodingErrorPolicy.

If the policy is 'error' then a processing error occurs (both case 1 and 
case 2)

If the policy is 'skip' then the character is skipped. No character is 
encoded to be output for case 1, and no partial character is attempted in 
case 2.

If the policy is 'replace' then the behavior is determined by the encoding 
specification, and by the dfdl:encodingReplacementCharacter annotation 
element.

If provided, the dfdl:encodingReplacementCharacter annotation can appear 
anywhere that DFDL annotations are allowed. 

It has two attributes, which are 'encoding', and 'character'. The 
'encoding' attribute specifies the encoding for which a replacement 
character is being specified. This takes the same values as the 
dfdl:encoding format property.
The 'character' attribute specifies a DFDL literal string specifying 
exactly one character. This character is used as the replacement character 
for the specified encoding whenever that encoding is in use, and the 
dfdl:encodingReplacementCharacter annotation is in scope according to the 
usual scoping rules.

There are these cases to consider when policy is 'replace'
1.      there is no standard replacement character defined as part of the 
encoding specification, and there is no dfdl:encodingReplacementCharacter 
annotation element.
2.      there is a standard replacement character defined as part of the 
encoding, and there is no dfdl:encodingReplacementCharacter annotation 
element.
3.      there is a dfdl:encodingReplacementCharacter annotation element
In case 1, since no replacement is possible, a processing error occurs.
In case 2, the standard replacement character is used to replace the 
unmapped or error data.
In case 3, the character specified by the 
dfdl:encodingReplacementCharacter annotation is used to replace the 
unmapped or error data. Note specifically, if the character set has a 
standard replacement character, the dfdl:encodingReplacementCharacter 
annotation can be used to override use of the standard replacement 
character.

In these cases 2, and 3, it is still possible to be unable to output the 
replacement character if there is not enough room for its encoding. This 
situation is always a processing error. 

Note for errata: The language in section 4.1.2 about encoding data from 
infoset Unicode has to change as well.

--
  dfdl-wg mailing list
  dfdl-wg@ogf.org
  http://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU