4. Infoset codepage and encoding
The spec does not say what
codepage and encoding is used for string fields.
I wanted
to comment on this.
There
are three choices here:
- unicode codepoints - we may need to preserve the
mapping table (from representation encoding to unicode) as part of the
infoset.
- "As
Encoded" codepoints - we must add the encoding to the
infoset.
- Both
In
favor of unicode codepoints - simplicity. Minor issue is that some mappings will
lose information making perfect round-tripping of string contents
impossible.
E.g.,
EBCDIC has two different line-endings both of which normally are translated to
ASCII/Unicode linefeed. Hence, translating back is
ambiguous.
In
favor of "as encoded" - simplicity. We just add an encoding attribute to the
string infoset object which returns the information that the dfdl:encoding
representation property contained. Note that the encoding information really is
already available via the schema component associated with the string, so there
is some redundancy here. Also, there's the issue when dealing with this of
whether one wants codepoints, or raw access to the bytes. E.g., if the encoding
is UTF-8 or shifted JIS, then the characters take up 1 or more bytes. Do you
want the bytes, or the interpreted code points or both?
In
favor of "both" - complexity, but eliminates all the
ambiguity.
My
suggestion: keep it simple for v1.0 - Choose number 1 - because we can always
expand the capabilities later by providing access to the unencoded
representation one way or another.
If you
badly need infoset-level contents which expose the actual representation
character codes, you can always model this as an array of bytes instead of a
character string.
...mike
Mike Beckerle |
OGF DFDL WG Co-Chair | CTO | Oco, Inc.
Tel:
781-810-2125 | 100 Fifth
Ave., 4th Floor, Waltham MA 02451 | mbeckerle.dfdl@gmail.com