Re: [DFDL-WG] Issue 156 - ICU fallback mappings - character encoding/decoding errors - version 001

6 Dec 2011

      Hi Mike

I've taken a read through the proposal below. Very good summary of the 
problem. Comments in-line below.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From:   Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:     dfdl-wg@ogf.org
Date:   22/11/2011 19:30
Subject:        [DFDL-WG] Issue 156 - ICU fallback mappings - character 
encoding/decoding errors - version 001
Sent by:        dfdl-wg-bounces@ogf.org

Issue 156 - ICU fallback mappings - character encoding/decoding errors

Summary

DFDL currently does not have adequate capability to handle encoding and 
decoding errors. Language in the spec is incorrect/infeasible to 
implement. ICU provides mechanisms giving degree of control over this 
issue, the question is whether and how to embrace those mechanisms, or 
provide some other alternative solution.

Discussion

This language in section 4.1.2 about character set decoding/encoding just 
doesn't work:

This first part is unacceptable because it fails to specify what happens 
when the decoding fails because of errors. It specifies what to do when 
there is no mapping to Unicode (which is, frankly, a very unlikely 
situation today) meaning a character is legally decoded, but then has no 
mapping.

During parsing, characters whose value is unknown or unrepresentable in 
ISO 10646 are replaced by the Unicode Replacement Character U+FFFD. 

This second part also fails to work:

During unparsing, characters that are unrepresentable in the target 
encoding will be replaced by the replacement character for that encoding.

Sounds symmetric and expedient, but the problem is that most character 
encodings have no reserved replacement character, and we expect that DFDL 
users will need a variety of different choices for how to deal with 
characters that cannot be encoded. 

The most general solution is to provide a table of what we can call 
fallback mappings which say what ISO 10646 characters translate into what 
character codes in the target character set. This is, in effect, 
specifying all or part of a table-driven character set encoder.

ICU provides a mechanism for this. DFDL could provide a mechanism for 
specifying these fallback mappings which can be converted by 
implementations into what the ICU libraries provide.

An implicit goal for DFDL is to take advantage of ICU so as to reduce 
implementation complexity.
SMH: We talk about ICU in the context of text number & calendar patterns, 
but not for encodings. We should change that.
A few concerns/issues we will discuss:
1.      Do we provide a "failure" option where a processing-error occurs 
rather than a substitution of a different character code?
2.      Do we require a default fallback mapping to be used for any 
character for which there is no other fallback? (Thereby making it 
possible to state that no processing error will occur.)
These issues are addressed in the suggested resolution below.

But first....

Some Useful Bits....

ICU lets you specify fallback mappings to be used when a primary encoding 
has no mapping. 

Example:

<UFFE4> \xFA\x55 
<UFFE5> \x81\x8F 
<UFFFD> \xFC\xFC

You also need the encoding name, but the lines above are ICU's notation 
for three individual fallback mappings.

Issue - you have to understand the encoding. You can't just easily specify 
that one character should be substituted when an unmapped character is 
encountered. An example of this is trying to round-trip data that has 
unmapped characters. On parsing, a Unicode string is created, and the 
Unicode Replacement Character with code 0xFFFD is substituted for any 
undefined characters. If we wish to round-trip this data back to the same 
external character encoding, we must say to what we want substituted for 
these 0xFFFD replacement characters in the output encoding.

In the above example, you see that 0xFFFD is substituted by the two bytes 
xFC and xFC again. However, you have to understand what those bytes mean 
in the encoding. There is no easy way to, for example, specify that 0xFFFD 
should be mapped to a space (&SP;) or to an "_" character, or a short 
string perhaps like "�" or "<ERR>" or "\E". Instead you have to 
find out the encoding of such things and express them in bytes.

The ICU libraries seem to have the ability to detect any sort of 
conversion error, and very general hooks for specifying which are ignored, 
and/or how they are resolved.

SMH: For the record, I believe ICU allows:
 - Skip  -> skip the character.
 - Stop  ->  stop and throw an error
 - Substitute  -> substitute with character defined in codepage
 - Escape  -> replaces with the hexadecimal representation of the illegal 
character.

There is also a draft for a Unicode standard for specifying and 
standardizing these mappings: Unicode Technical Standard #22, which has a 
syntax more like this:

  <!--Fallbacks-->
  <fub u="00A1" b="21" />
  <fub u="00A2" b="81 91" />
  <fub u="00A3" b="81 92" />
  <fub u="00A5" b="5C" />
  <fub u="00A6" b="7C" />
  <fub u="00A9" b="63" />
Same information content as the ICU stuff. Just more XML-ish format. Same 
problems with having to understand the encoding in detail.

Suggested Resolution - Part 1 - Fallback Mappings

For DFDL, I propose new annotation element e.g., like 
dfdl:encodingModifier like so:

   <dfdl:encodingModifier encoding="ASCII" character="%#x00A2;"  
replacement="%#x81;%#x91" />

SMH: Not clear from the syntax that this is intended for unparse only. I 
think we need the word 'output' in there, like we do for 'outputNewLine' 
and 'textOutputMinLength'.

This annotation element specifies the encoding being modified, and is 
otherwise a way of specifying the same information content as a UTS22 fub 
element but in a way consistent with the rest of the DFDL language.
The character attribute is directly analogous to the u attribute in the 
fub element from the UTS22 proposal except we specify a single Unicode 
character instead of hex. The replacement attribute is a DFDL literal 
string, illustrated above as specifying two bytes in the DFDL-way of doing 
so.

This element also allows things like:

  <dfdl:encodingModifier encoding="ASCII" character="%#xFFFD;" 
replacement="<ERR>" />

That is, the DFDL user does not need to manually figure out the 
consecutive bytes that make up "<ERR>" in their target encoding, rather 
they can use printing characters in the usual DFDL way for a string 
literal.

SMH: What if the encoding translation from the replacement string to the 
target fails?  By allowing DFDL entities are we providing anything that 
ICU can't achieve?  What if the replacement causes a fixed length to be 
exceeded?

This element is straightforwardly converted into either the UTS22 or ICU 
fallback notation for ease of implementations.

A beneficial side effect is that our use of a character for the input side 
of these mappings will naturally allow for things like this in a DFDL 
schema:

<dfdl:encodingModifier encoding="ASCII" character="年" replacement="Y"/>

That character is the Japanese Kanji character for year, used in dates 
like: 2003年08月27日. It clearly has no representation in ASCII, so 
representing it as "Y" is a plausible substitute character. Of course if 
the string data will be processed by a program that can process say, the 
XML-style of entity notation, then one can do this:

<dfdl:encodingModifier encoding="ASCII" character="年" 
replacement="&#x5E74;"/>

This would output "年" to the output, which is pure ASCII, and will 
read back in as the Kanji character in a Unicode-capable program that 
understands these entities. Note: there is no mechanism here for say, 
translating any Japanese Kanji character into its corresponding entity 
format. That's beyond the scope of what we're trying to achieve here, and 
is really a data transformation. The above is really just a pleasant side 
effect of taking the UTS22/ICU stuff, and mapping it into DFDL in a way 
that is consistent with the rest of DFDL.

SMH: As proposed there is no way to place encoding modifiers in one xsd 
and have them picked up by another. We need a scoping mechanism. 
Suggestions:

a) Syntax as you propose but allow dfdl:encodingModifier only as a child 
of dfdl:format.

        <dfdl:format encoding="ASCII" ...>
                <dfdl:encodingModifier encoding="ASCII" character="年" 
replacement="&#x5E74;"/>
                <dfdl:encodingModifier encoding="ASCII" 
character="%#x00A2;"  replacement="%#x81;%#x91" />
        </dfdl:format>

b) Syntax like defineEscapeScheme and escapeSchemeRef. That way we get 
scoping rules behaving as expected, with infinite flexibility.

<dfdl:defineEncodingModifier name="ASCII-mod-1" >
        <dfdl:encodingModifier character="年" replacement="&#x5E74;"/>
        <dfdl:encodingModifier character="%#x00A2;"  
replacement="%#x81;%#x91" />
</dfdl:defineEncodingModifier>

<dfdl:format encoding="ASCII" encodingModifierRef="ASCII-mod-1" ... />

My vote is for b).

Wild-Card Mappings

A wild-card mapping, specified like below by just leaving out the 
character attribute, would allow substitution for any otherwise unmapped 
character (no standard mapping, and no other fallback mapping)

  <dfdl:encodingModifier encoding="ASCII" replacement="$$" /> 

This example would translate any Unicode character headed into ASCII, for 
which there is no default and no fallback mapping, into two dollar signs.

SMH: Are wildcard encodingModifiers allowed in conjunction with specific 
encodingModifiers?  I would say 'yes', the specific takes precedence.

Suggested Resolution - Part 2 - Parsing/Decoding Errors

There are three errors that can occur when decoding characters into 
Unicode/ISO 10646. 
1.      a validly decoded character has no assigned mapping to Unicode 
(TBD: can this really happen?)
2.      the data is broken - invalid byte sequences that don't match the 
definition of the encoding are encountered.
3.      not enough bytes are found to make up the entire encoding of a 
character. That is, a fragment of a valid encoding is found.

For (1) the Unicode replacement character '�' (U+FFFD) is substituted. 

For (2), the private use area (PUA) Unicode code points U+E000..U+E0FF are 
used where the low 8 bits are each invalid byte's value. This preserves 
the information content of such data, albeit with some processing required 
to manipulate such data. (Note: This is a variation on an idea suggested 
as a popular way of dealing with invalid byte sequences found here, though 
using the reserved private use area (PUA) codepoints instead of the 
illegal reserved surrogate codepoints.)

For (3) the Unicode replacement character '�' (U+FFFD) is substituted. 

In case (2), if a DFDL schema author wishes to disallow these characters 
and get a processing error, then a dfdl:assert can be used with a pattern 
expression to inspect the characters and cause an assertion failure if any 
of the above character codes are found in the infoset string. A derived 
simple subtype from string can make this notationally convenient if the 
behavior is desired for a large number of string elements.

Example: <dfdl:assert testKind="pattern" 
test="^[-%#xFFFD;|-%#xE000;-%#xE0FF]+" message="illegal or substitution 
characters were found"/>

TBD: That pattern regex is of course an attempt to say a string not 
containing any of those character codes. I am not sure it is correct, but 
I am sure one is possible so long as we have negation and can depend on 
the ability to put DFDL (or XML) entities into the string literal that 
makes up a regex.

SMH: I am ok with (1) but maybe the user needs to given the choice of 
processing error? How is the key question. Is this a property or a 
processing option? 
For (2) and (3) I think this is a genuine error and the default behaviour 
should be processing error. Does ICU let us distinguish (1) from (2) and 
(3) ? 
I think use of a dfdl:assert as a workaround to throwing an error is too 
heavyweight here. You'd have to do this on every element, there's no way 
to scope it.

The language in section 4.1.2 about decoding data into infoset Unicode has 
to change of course as well.

Suggested Resolution - Part 3 - Unparsing/Encoding Errors

The following are kinds of errors when encoding characters:
1.      no mapping by default, no fallback mapping specified - and no 
wild-card mapping specified
2.      not enough room to output the entire encoding of the character 
(e.g., need 2 bytes for a DBCS, but only 1 byte remains in the available 
length)
3.      infoset data contains the reserved private use Unicode code points 
U+E000 to U+E0FF

Case (1) is a processing error. A DFDL schema author can defend against 
this by providing a wild-card mapping.

Case (2) the subset of the bytes that fit in the allowed space are output, 
any subsequent bytes of the character encoding are silently dropped.

SMH: Surely this is just a b-a-u processing error - the data can't fit 
into the specified length ? 

Case (3) the bytes 0x00 to 0xFF are output corresponding to the characters 
U+E000 to U+E0FF. This enables round-tripping of data that contains 
character encoding errors.

The language in section 4.1.2 about encoding data from infoset Unicode has 
to change as well.

References: 

* ICU conversion rules tables - 
http://userguide.icu-project.org/conversion/data#TOC-Examples-for-codepage-s...

* Unicode Technical Standard #22 - Unicode Character Mapping Markup 
Language (CharMapML) http://www.unicode.org/reports/tr22/tr22-7.html
* Invalid Encodings: 
http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences 

-- 
Mike Beckerle | OGF DFDL WG Co-Chair 
Tel:  781-330-0412
--
  dfdl-wg mailing list
  dfdl-wg@ogf.org
  http://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU