Mike
Comments below
(SMH).
I've noted what Andreas said in his reply.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
dfdl-wg@ogf.org, Andreas
Martens1/UK/IBM@IBMGB
Date:
06/12/2011 17:35
Subject:
[DFDL-WG] Issue
156 - ICU fallback mappings - character encoding/decoding errors (version
2 - modified per call 2011-12-06)
Sent by:
dfdl-wg-bounces@ogf.org
Issue 156 - ICU fallback mappings - character encoding/decoding
errors
(Modified per workgroup discussion on 2011-12-06 - removed rationale and
discussion, simplified to just the minimum. Note couple of important TBDs
in here. Topics we forgot to discuss.)
Summary
DFDL currently does not have adequate capability to handle encoding and
decoding errors. Language in the spec is incorrect/infeasible to implement.
ICU provides mechanisms giving degree of control over this issue, the question
is whether and how to embrace those mechanisms, or provide some other alternative
solution.
Discussion
This language in section 4.1.2 about character set decoding/encoding just
doesn't work:
This first part is unacceptable because it fails to specify what happens
when the decoding fails because of errors. It specifies what to do when
there is no mapping to Unicode (which is, frankly, a very unlikely situation
today) meaning a character is legally decoded, but then has no mapping.
During parsing, characters whose value is unknown or
unrepresentable in ISO 10646 are replaced by the Unicode Replacement Character
U+FFFD.
This second part also fails to work:
During unparsing, characters that are unrepresentable
in the target encoding will be replaced by the replacement character for
that encoding.
Sounds symmetric and expedient, but the problem is that some character
encodings have no reserved replacement character, and we expect that DFDL
users will need a variety of different choices for how to deal with characters
that cannot be encoded.
SMH: From what Andreas says, the vast majority
of encodings supported by ICU have a replacement character.
Suggested Resolution: Summary
- DFDL property dfdl:inputEncodingErrorPolicy with values
'skip', 'error', 'replace'
- DFDL property dfdl:outputEncodingErrorPolicy with values
'skip', 'error', 'replace'
- DFDL annotation element. Example: <dfdl:encodingReplacementCharacter
encoding="ASCII" character="%#x7c;"/>
SMH: If the vast majority of encodings
have a replacement character, then we don't need the new annotation.
If we do keep it, then
1) Not clear from the syntax that this is
intended for unparse only. The word 'output' should be in there, eg, outputEncodingReplacementCharacter.
2) As proposed there is no way to place encoding
modifiers in one xsd and have them picked up by another. We need a scoping
mechanism.
My vote for 1.0 is not to add the new annotation, just the new properties.
Can we get away with just one property that
applies to both input and output? Most DFDL properties apply to both.
For Parsing/Decoding Errors
There are two errors that can occur when decoding characters into Unicode/ISO
10646.
1. the
data is broken - invalid byte sequences that don't match the definition
of the encoding are encountered.
2. not
enough bytes are found to make up the entire encoding of a character. That
is, a fragment of a valid encoding is found.
The behavior in these cases is controlled by dfdl:inputEncodingErrorPolicy.
If 'replace', then the Unicode replacement
character '�' (U+FFFD) is substituted for
the offending bytes, one replacement character for each invalid byte, one
replacement character for any fragment of an encoding.
(TBD: Assumptions to validate: I am assuming here
that if there are 6 invalid bytes, none of which can validly be unit 1
of the encoding of any character, that ICU will call the error hook either
(a) 6 times, or (b) once but notifying about all 6 bad units - but providing
a way for the hook-writer to say they want to substitute 6 characters for
the 6 units.
I am also assuming in the end-of-data fragment case that the ICU hook gets
called once for the fragment, not once per byte of the fragment.)
(TBD: We did not discuss on the call on Dec 6th, the issue of errors in
unicode encodings. While there are no encodings where a properly encoded
character is unmapped to unicode, the unicode UTF encodings themselves
can contains things that are errors. Here's a short list of some things
that can happen:
- utf-16 and unpaired surrogate code-point
- utf-16 and out-of-order surrogate code-point pair
- utf-8 parsing and 3-byte encoding of a surrogate
code-point is found
- utf-8 unparsing and code-point of an isolated surrogate
is to be encoded.
- utf-8 decoding, and if you assemble the bits the
usual way, you get a code point out of range (higher than 0x10FFFF)
- utf-8 encoding, and code-point to encode is higher
than 0x10FFFF.
- utf-16 encoding utf16Width='fixed' and a surrogate
code point is encountered
- utf-16 byte-order-marks found not at the beginning
of the data
We have an option here to be 'tolerant' of unicode-encoding
foibles. We can preserve isolated surrogates in a natural way if we wish.
I believe many Unicode and UTF implementations tolerate these situations.
For example the standard Java utf-8 decoder/encoder InputStreamReader and
OutputStreamWriter, is tolerant of incorrectly paired and isolated surrogate
code points in the Java string data.
I do not know what ICU does in these cases, i.e., if it provides us enough
flexibility to do whatever we want, or if it doesn't even detect some of
these things as errors.)
If 'skip' then the invalid byte sequences are dropped/ignored. No corresponding
characters are created in the DFDL infoset.
If 'error' then a processing error occurs.
It is suggested that if a DFDL user wants to preserve information containing
data where the encodings have these kinds of errors, that they model such
data as xs:hexBinary, or as a xs:string, but using an encoding such as
iso-8859-1 which preserves all bytes.
Note for errata: The language in section 4.1.2 Item 5 about decoding data
into infoset Unicode has to change of course as well.
Suggested Resolution - Unparsing/Encoding Errors
The following are kinds of errors when encoding characters:
1. no
mapping provided by the encoding specification.
2. not
enough room to output the entire encoding of the character (e.g., need
2 bytes for a DBCS, but only 1 byte remains in the available length.
The behavior in these cases is controlled by dfdl:outputEncodingErrorPolicy.
If the policy is 'error' then a processing error occurs (both case 1 and
case 2)
If the policy is 'skip' then the character is skipped. No character is
encoded to be output for case 1, and no partial character is attempted
in case 2.
If the policy is 'replace' then the behavior is determined by the encoding
specification, and by the dfdl:encodingReplacementCharacter
annotation element.
If provided, the dfdl:encodingReplacementCharacter annotation can appear
anywhere that DFDL annotations are allowed.
It has two attributes, which are 'encoding', and 'character'.
The 'encoding' attribute specifies the encoding for which a replacement
character is being specified. This takes the same values as the dfdl:encoding
format property.
The 'character' attribute specifies a DFDL literal string specifying exactly
one character. This character is used as the replacement character for
the specified encoding whenever that encoding is in use, and the dfdl:encodingReplacementCharacter
annotation is in scope according to the usual scoping rules.
There are these cases to consider when policy is 'replace'
1. there
is no standard replacement character defined as part of the encoding specification,
and there is no dfdl:encodingReplacementCharacter annotation element.
2. there
is a standard replacement character defined as part of the encoding, and
there is no dfdl:encodingReplacementCharacter annotation element.
3. there
is a dfdl:encodingReplacementCharacter annotation element
In case 1, since no replacement is possible, a processing
error occurs.
In case 2, the standard replacement character is used to replace the unmapped
or error data.
In case 3, the character specified by the dfdl:encodingReplacementCharacter
annotation is used to replace the unmapped or error data. Note specifically,
if the character set has a standard replacement character, the dfdl:encodingReplacementCharacter
annotation can be used to override use of the standard replacement character.
In these cases 2, and 3, it is still possible to be unable to output the
replacement character if there is not enough room for its encoding. This
situation is always a processing error.
Note for errata: The language in section 4.1.2 about encoding data from
infoset Unicode has to change as well.
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU