Sounds symmetric and expedient, but the problem is that most character
encodings have no reserved replacement character, and we expect that DFDL
users will need a variety of different choices for how to deal with characters
that cannot be encoded.
The most general solution is to provide a table of what we can call fallback
mappings which say what ISO 10646 characters translate into what character
codes in the target character set. This is, in effect, specifying all or
part of a table-driven character set encoder.
ICU provides a mechanism for this. DFDL could provide a mechanism for specifying
these fallback mappings which can be converted by implementations into
what the ICU libraries provide.
An implicit goal for DFDL is to take advantage of ICU so as to reduce implementation
complexity.
SMH: We talk about ICU in the context of text number & calendar patterns, but not for encodings. We should change that.
A few concerns/issues we will discuss:
1. Do
we provide a "failure" option where a processing-error occurs
rather than a substitution of a different character code?
2. Do
we require a default fallback mapping to be used for any character for
which there is no other fallback? (Thereby making it possible to state
that no processing error will occur.)
These issues are addressed in the suggested resolution
below.
But first....
Some Useful Bits....
ICU lets you specify fallback mappings to be used when a primary encoding
has no mapping.
Example:
<UFFE4> \xFA\x55
<UFFE5> \x81\x8F
<UFFFD> \xFC\xFC
You also need the encoding name, but the lines above are
ICU's notation for three individual fallback mappings.
Issue - you have to understand the encoding. You can't just easily specify
that one character should be substituted when an unmapped character is
encountered. An example of this is trying to round-trip data that has unmapped
characters. On parsing, a Unicode string is created, and the Unicode Replacement
Character with code 0xFFFD is substituted for any undefined characters.
If we wish to round-trip this data back to the same external character
encoding, we must say to what we want substituted for these 0xFFFD replacement
characters in the output encoding.
In the above example, you see that 0xFFFD is substituted by the two bytes
xFC and xFC again. However, you have to understand what those bytes mean
in the encoding. There is no easy way to, for example, specify that 0xFFFD
should be mapped to a space (&SP;) or to an "_" character,
or a short string perhaps like "�" or "<ERR>"
or "\E". Instead you have to find out the encoding of such things
and express them in bytes.
The ICU libraries seem to have the ability to detect any sort of conversion
error, and very general hooks for specifying which are ignored, and/or
how they are resolved.
SMH: For the record, I believe ICU allows:
- Skip ->
skip the character.
- Stop ->
stop and throw an error
- Substitute ->
substitute with character defined in codepage
- Escape ->
replaces with the hexadecimal representation of the illegal character.
There is also a draft for a Unicode standard for specifying and standardizing
these mappings: Unicode Technical Standard #22, which has a syntax more
like this:
<!--Fallbacks-->
<fub u="00A1" b="21" />
<fub u="00A2" b="81 91" />
<fub u="00A3" b="81 92" />
<fub u="00A5" b="5C" />
<fub u="00A6" b="7C" />
<fub u="00A9" b="63" />
Same information content as the ICU stuff. Just more XML-ish
format. Same problems with having to understand the encoding in detail.
Suggested Resolution - Part 1 - Fallback Mappings
For DFDL, I propose new annotation element e.g., like dfdl:encodingModifier
like so:
<dfdl:encodingModifier encoding="ASCII" character="%#x00A2;"
replacement="%#x81;%#x91" />
SMH: Not clear from the syntax that this
is intended for unparse only. I think we need the word 'output' in there,
like we do for 'outputNewLine' and 'textOutputMinLength'.
This annotation element specifies the encoding being modified, and is otherwise
a way of specifying the same information content as a UTS22 fub element
but in a way consistent with the rest of the DFDL language.
The character attribute is directly analogous to the u attribute in the
fub element from the UTS22 proposal except we specify a single Unicode
character instead of hex. The replacement attribute is a DFDL literal string,
illustrated above as specifying two bytes in the DFDL-way of doing so.
This element also allows things like:
<dfdl:encodingModifier encoding="ASCII" character="%#xFFFD;"
replacement="<ERR>" />
That is, the DFDL user does not need to manually figure out the consecutive
bytes that make up "<ERR>" in their target encoding, rather
they can use printing characters in the usual DFDL way for a string literal.
SMH: What if the encoding translation from
the replacement string to the target fails? By allowing DFDL entities
are we providing anything that ICU can't achieve? What if the replacement
causes a fixed length to be exceeded?
This element is straightforwardly converted into either the UTS22 or ICU
fallback notation for ease of implementations.
A beneficial side effect is that our use of a character for the input side
of these mappings will naturally allow for things like this in a DFDL schema:
<dfdl:encodingModifier encoding="ASCII" character="年"
replacement="Y"/>
That character is the Japanese Kanji character for year, used in dates
like: 2003年08月27日. It clearly has no representation in ASCII, so
representing it as "Y" is a plausible substitute character. Of
course if the string data will be processed by a program that can process
say, the XML-style of entity notation, then one can do this:
<dfdl:encodingModifier encoding="ASCII" character="年"
replacement="&#x5E74;"/>
This would output "年" to the output, which is pure
ASCII, and will read back in as the Kanji character in a Unicode-capable
program that understands these entities. Note: there is no mechanism here
for say, translating any Japanese Kanji character into its corresponding
entity format. That's beyond the scope of what we're trying to achieve
here, and is really a data transformation. The above is really just a pleasant
side effect of taking the UTS22/ICU stuff, and mapping it into DFDL in
a way that is consistent with the rest of DFDL.
SMH: As proposed there is no way to place
encoding modifiers in one xsd and have them picked up by another. We need
a scoping mechanism. Suggestions:
a) Syntax as you propose but allow dfdl:encodingModifier
only as a child of dfdl:format.
<dfdl:format
encoding="ASCII" ...>
<dfdl:encodingModifier encoding="ASCII"
character="年" replacement="&#x5E74;"/>
<dfdl:encodingModifier
encoding="ASCII" character="%#x00A2;" replacement="%#x81;%#x91"
/>
</dfdl:format>
b) Syntax like defineEscapeScheme and escapeSchemeRef.
That way we get scoping rules behaving as expected, with infinite flexibility.
<dfdl:defineEncodingModifier name="ASCII-mod-1"
>
<dfdl:encodingModifier
character="年" replacement="&#x5E74;"/>
<dfdl:encodingModifier character="%#x00A2;"
replacement="%#x81;%#x91" />
</dfdl:defineEncodingModifier>
<dfdl:format encoding="ASCII"
encodingModifierRef="ASCII-mod-1" ... />
My vote is for b).
Wild-Card Mappings
A wild-card mapping, specified like below by just leaving out the character
attribute, would allow substitution for any otherwise unmapped character
(no standard mapping, and no other fallback mapping)
<dfdl:encodingModifier encoding="ASCII" replacement="$$"
/>
This example would translate any Unicode character headed into ASCII, for
which there is no default and no fallback mapping, into two dollar signs.
SMH: Are wildcard encodingModifiers allowed
in conjunction with specific encodingModifiers? I would say 'yes',
the specific takes precedence.
Suggested Resolution - Part 2 - Parsing/Decoding Errors
There are three errors that can occur when decoding characters into Unicode/ISO
10646.
1. a
validly decoded character has no assigned mapping to Unicode (TBD: can
this really happen?)
2. the
data is broken - invalid byte sequences that don't match the definition
of the encoding are encountered.
3. not
enough bytes are found to make up the entire encoding of a character. That
is, a fragment of a valid encoding is found.
For (1) the Unicode replacement
character '�' (U+FFFD) is substituted.
For (2), the private use area (PUA) Unicode code points U+E000..U+E0FF
are used where the low 8 bits are each invalid byte's value. This preserves
the information content of such data, albeit with some processing required
to manipulate such data. (Note: This is a variation on an idea suggested
as a popular way of dealing with invalid byte sequences found here,
though using the reserved private use area (PUA) codepoints instead of
the illegal reserved surrogate codepoints.)
For (3) the Unicode replacement
character '�' (U+FFFD) is substituted.
In case (2), if a DFDL schema author wishes to disallow
these characters and get a processing error, then a dfdl:assert can be
used with a pattern expression to inspect the characters and cause an assertion
failure if any of the above character codes are found in the infoset string.
A derived simple subtype from string can make this notationally convenient
if the behavior is desired for a large number of string elements.
Example: <dfdl:assert testKind="pattern" test="^[-%#xFFFD;|-%#xE000;-%#xE0FF]+"
message="illegal or substitution characters were found"/>
TBD: That pattern regex is of course an attempt to say a string not containing
any of those character codes. I am not sure it is correct, but I am sure
one is possible so long as we have negation and can depend on the ability
to put DFDL (or XML) entities into the string literal that makes up a regex.
SMH: I am ok with (1) but maybe the user
needs to given the choice of processing error? How is the key question.
Is this a property or a processing option?
For (2) and (3) I think this is a genuine
error and the default behaviour should be processing error. Does ICU let
us distinguish (1) from (2) and (3) ?
I think use of a dfdl:assert as a workaround
to throwing an error is too heavyweight here. You'd have to do this on
every element, there's no way to scope it.
The language in section 4.1.2 about decoding data into infoset Unicode
has to change of course as well.
Suggested Resolution - Part 3 - Unparsing/Encoding Errors
The following are kinds of errors when encoding characters:
1. no
mapping by default, no fallback mapping specified - and no wild-card mapping
specified
2. not
enough room to output the entire encoding of the character (e.g., need
2 bytes for a DBCS, but only 1 byte remains in the available length)
3. infoset
data contains the reserved private use Unicode code points U+E000 to U+E0FF
Case (1) is a processing error. A DFDL schema author can
defend against this by providing a wild-card mapping.
Case (2) the subset of the bytes that fit in the allowed
space are output, any subsequent bytes of the character encoding are silently
dropped.
SMH: Surely this is just a b-a-u processing
error - the data can't fit into the specified length ?
Case (3) the bytes 0x00 to 0xFF are output corresponding
to the characters U+E000 to U+E0FF. This enables round-tripping of data
that contains character encoding errors.
The language in section 4.1.2 about encoding data from infoset Unicode
has to change as well.
References:
* ICU conversion rules tables - http://userguide.icu-project.org/conversion/data#TOC-Examples-for-codepage-state-tables
* Unicode Technical Standard #22 - Unicode Character Mapping Markup Language
(CharMapML) http://www.unicode.org/reports/tr22/tr22-7.html
* Invalid Encodings: http://en.wikipedia.org/wiki/UTF-8#Invalid_byte_sequences
--
Mike Beckerle | OGF DFDL WG Co-Chair
Tel: 781-330-0412
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU