Tel:
781-810-2125 |
From: | DFDL <mbeckerle.dfdl@gmail.com> |
To: | Steve Hanson/UK/IBM@IBMGB |
Cc: | Alan Powell/UK/IBM@IBMGB, "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, "dfdl-wg-bounces@ogf.org" <dfdl-wg-bounces@ogf.org> |
Date: | 05/05/2009 15:35 |
Subject: | Re: [DFDL-WG] Infoset codepage |
"Mike Beckerle"
<mbeckerle.dfdl@gmail.com> Sent by: dfdl-wg-bounces@ogf.org 05/05/2009 14:09
|
|
I wanted to comment on this.
There are three choices here:
1. unicode codepoints - we may need to preserve the mapping table
(from representation encoding to unicode) as part of the infoset.
2.
"As Encoded" codepoints -
we must add the encoding to the infoset.
3. Both
In favor of unicode codepoints - simplicity. Minor issue is that some
mappings will lose information making perfect round-tripping of string contents
impossible.
E.g., EBCDIC has two different line-endings both of which normally
are translated to ASCII/Unicode linefeed. Hence, translating back is
ambiguous.
In favor of "as encoded" - simplicity. We just add an encoding
attribute to the string infoset object which returns the information that the
dfdl:encoding representation property contained. Note that the encoding
information really is already available via the schema component associated with
the string, so there is some redundancy here. Also, there's the issue when
dealing with this of whether one wants codepoints, or raw access to the bytes.
E.g., if the encoding is UTF-8 or shifted JIS, then the characters take up 1 or
more bytes. Do you want the bytes, or the interpreted code points or
both?
In favor of "both" - complexity, but eliminates all the
ambiguity.
My suggestion: keep it simple for v1.0 - Choose number 1 - because we
can always expand the capabilities later by providing access to the unencoded
representation one way or another.
If you badly need infoset-level contents which
expose the actual representation character codes, you can always model this as
an array of bytes instead of a character string.
...mike
Mike Beckerle | OGF DFDL WG Co-Chair
| CTO | Oco, Inc.
Tel:
781-810-2125 | 100 Fifth Ave., 4th Floor, Waltham MA 02451
| mbeckerle.dfdl@gmail.com --
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United
Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United
Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU