Actually your original wording is correct
- my memory was at fault. UTF-8 can go up to 4 bytes. But the confusion
in my mind was caused by a distant memory of the CESU 6-byte thing. Your
questions below are valid ones.
The CESU-8 question will naturally arise
because DFDL offers the dfdl:utf16width property. The implication of utf16width
is that DFDL recognises the fact that some applications do not distinguish
between a UTF-16 code point ( 16 bits ) and a UTF-16 character ( 16 or
32 bits ). The existence of the property implies that we want the decision
to be an explicit decision taken by the modeller. I think that argues for
strict serialization of UTF-8, with support for the ( non-Unicode ) CESU-8
encoding being an optional feature in DFDL processors.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Tim Kimber/UK/IBM@IBMGB,
Steve Hanson/UK/IBM@IBMGB, dfdl-wg@ogf.org
Date:
14/12/2011 20:53
Subject:
Re: [DFDL-WG]
Issue 156 - ICU fallback mappings - character encoding/decoding errors
(version 2 - modified per call 2011-12-06)
Tim, do you think you were thinking ofthis
encoding (CESU-8) (pronounced "sez you")
of surrogate pairs as 2 3-byte UTF-8 sequences?
I believe there is also this hack by which code point 0 is encoded as two
bytes instead of just a 0. Not sure why this was needed, but it was a Java
object-serialization convention.
I was expecting that the ICU UTF-8 parser would deal with these, but it
traps them as errors. Using the callback hook one could change it to handle
them, or an encoding description that is more flexible could be created.
On parsing, being able to accept everything possible seems good.
The big concern is what to generate on unparse. E.g., for a floating surrogate,
generate CESU-8 3-byte sequence? or error out/substitute? For a surrogate
pair, generate two 3-byte CESU sequences for 6 byte total, or the UTF-8
standard 4-byte encoding?
Or, perhaps we're just trying to squeeze too much into one encoding, and
we actually need a strict and a tolerant variant of UTF-8? Like maybe people
should say CESU-8 if that's what they mean?
...mikeb
On Wed, Dec 14, 2011 at 5:29 AM, Tim Kimber <KIMBERT@uk.ibm.com>
wrote:
This is a little picky, but as the whole
point is to tighten up the spec....
UTF-8 characters should only ever be 1,2, or 3 bytes in length.
In some applications a single Unicode character that is outside of the
BMP ( so needs to be a surrogate pair in UTF-16 ) can end up as a pair
of 2-byte UTF-8 characters. So the end result is 4 bytes of UTF-8 for a
single Unicode character. But that's frowned upon by the Unicode consortium.
The application should convert the single Unicode character to a single
3-byte UTF-8 character.
regards,
Tim Kimber, Common Transformation Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 246742
From: Steve
Hanson/UK/IBM@IBMGB
To: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
Cc: dfdl-wg@ogf.org,
Andreas Martens1/UK/IBM@IBMGB
Date: 14/12/2011
07:45
Subject:
Re: [DFDL-WG] Issue
156 - ICU fallback mappings - character encoding/decoding errors (version
2 - modified per call 2011-12-06)
Sent by:
dfdl-wg-bounces@ogf.org
Mike, I think this proposal looks good and provides an adequate solution
for DFDL 1.0. Let's discuss further on today's WG call.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: Steve
Hanson/UK/IBM@IBMGB
Cc: Andreas
Martens1/UK/IBM@IBMGB, dfdl-wg@ogf.org
Date: 07/12/2011
15:02
Subject: Re:
[DFDL-WG] Issue 156 - ICU fallback mappings - character encoding/decoding
errors (version 2 - modified per call 2011-12-06)
Alright, I was able to convince myself that a substitution character is
available, and associated with the IANA character set ID aliases. Even
us-ascii has one (\x1A) E.g., http://demo.icu-project.org/icu-bin/convexp?conv=US-ASCII&s=ALL
So our original language that said to just use "the replacement character
for the encoding" was actually correct!
Revised proposal below. Basically, it's just error, skip or replace flag
for encoding error policy. We still have to figure out the TBDs in there
with respect to how many substitution/replacements will occur, and what
to do about some of these Unicode-encoding related issues.
...mikeb
---------------------------------------------------------------------------
Issue 156 - ICU fallback mappings - character encoding/decoding errors
(modified per email thread on standardized ICU substitution/replacement
characters)
(Modified per workgroup discussion on 2011-12-06 - removed rationale and
discussion, simplified to just the minimum. Note couple of important TBDs
in here. Topics we forgot to discuss.)
Summary
DFDL currently does not have adequate capability to handle encoding and
decoding errors. Language in the spec is incorrect/infeasible to implement.
ICU provides mechanisms giving degree of control over this issue, the question
is whether and how to embrace those mechanisms, or provide some other alternative
solution.
Discussion
This language in section 4.1.2 about character set decoding/encoding just
doesn't work:
This first part is unacceptable because it fails to specify what happens
when the decoding fails because of data errors.
During parsing, characters whose value is unknown or unrepresentable in
ISO 10646 are replaced by the Unicode Replacement Character U+FFFD.
This second part also is inadequate:
During unparsing, characters that are unrepresentable in the target encoding
will be replaced by the replacement character for that encoding.
This needs a citation for where these replacement characters are specified.
It also needs to specify what happens in certain error situations.
Suggested Resolution: Summary
- DFDL property dfdl:encodingErrorPolicy with values 'skip',
'error', 'replace'
For Parsing/Decoding Errors
There are two errors that can occur when decoding characters into Unicode/ISO
10646.
1. the data is broken -
invalid byte sequences that don't match the definition of the encoding
are encountered.
2. not enough bytes are
found to make up the entire encoding of a character. That is, a fragment
of a valid encoding is found.
The behavior in these cases is controlled by dfdl:inputEncodingErrorPolicy.
If 'replace', then the Unicode replacement
character '�' (U+FFFD) is substituted for
the offending bytes, one replacement character for each invalid byte, one
replacement character for any fragment of an encoding.
(TBD: Should this say 'byte' or 'unit' ?? I.e., in UTF-16BE, will ICU error
callback occur once for a broken codepoint, or once per byte?)
(TBD: Assumptions to validate: I am assuming here that if there are 6 invalid
bytes, none of which can validly be unit 1 of the encoding of any character,
that ICU will call the error hook either (a) 6 times, or (b) once but notifying
about all 6 bad units - but providing a way for the hook-writer to say
they want to substitute 6 characters for the 6 units.
I am also assuming in the end-of-data fragment case that the ICU hook gets
called once for the fragment, not once per byte of the fragment.)
(TBD: We did not discuss on the call on Dec 6th, the issue of errors in
unicode encodings. While there are no encodings where a properly encoded
character is unmapped to unicode, the unicode UTF encodings themselves
can contains things that are errors. Here's a short list of some things
that can happen:
- utf-16 and unpaired surrogate code-point
- utf-16 and out-of-order surrogate code-point pair
- utf-8 parsing and 3-byte encoding of a surrogate
code-point is found
- utf-8 unparsing and code-point of an isolated surrogate
is to be encoded.
- utf-8 decoding, and if you assemble the bits the
usual way, you get a code point out of range (higher than 0x10FFFF)
- utf-8 encoding, and code-point to encode is higher
than 0x10FFFF.
- utf-16 encoding utf16Width='fixed' and a surrogate
code point is encountered
- utf-16 byte-order-marks found not at the beginning
of the data
We have an option here
to be 'tolerant' of unicode-encoding foibles. We can preserve isolated
surrogates in a natural way if we wish. I believe many Unicode and UTF
implementations tolerate these situations. For example the standard Java
utf-8 decoder/encoder InputStreamReader and OutputStreamWriter, is tolerant
of incorrectly paired and isolated surrogate code points in the Java string
data.
I do not know what ICU does in these cases, i.e.,
if it provides us enough flexibility to do whatever we want, or if it doesn't
even detect some of these things as errors.)
If 'skip' then the invalid byte sequences are dropped/ignored. No corresponding
characters are created in the DFDL infoset.
If 'error' then a processing error occurs.
It is suggested that if a DFDL user wants to preserve information containing
data where the encodings have these kinds of errors, that they model such
data as xs:hexBinary, or as a xs:string, but using an encoding such as
iso-8859-1 which preserves all bytes.
Suggested Resolution - Unparsing/Encoding Errors
The following are kinds of errors when encoding characters:
1. no mapping provided by
the encoding specification.
2. not enough room to output
the entire encoding of the character (e.g., need 2 bytes for a DBCS, but
only 1 byte remains in the available length.
The behavior in these cases is controlled by dfdl:encodingErrorPolicy.
If the policy is 'error' then a processing error occurs.
If the policy is 'skip' then the character is skipped. No character is
encoded to be output for case 1, and no partial character is attempted
in case 2.
If the policy is 'replace' then the behavior is determined by the encoding
specification.
Each encoding has a replacement/substitution character specified by the
ICU. These can be found conveniently in the ICU
Converter Explorer. This character is
substituted for the unmapped character or the character that has too large
an encoding (errors 1, and 2 above).
It is a processing error if it is not possible to output the replacement
character because there is not enough room for its representation.
It is a processing error if a character encoding does not provide a substitution/replacement
character definition and one is needed because of dfdl:encodingErrorPolicy='replace'.
(This would be rare, but could occur if a DFDL implementation allows many
encodings beyond the minimum set.)
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org//mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
Mike Beckerle | OGF DFDL WG Co-Chair
Tel: 781-330-0412
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU