Isn't choice 2 the most flexible? The caller can convert to what they need.

Alan Powell

MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell@uk.ibm.com
Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898

From:	DFDL <mbeckerle.dfdl@gmail.com>
To:	Steve Hanson/UK/IBM@IBMGB
Cc:	Alan Powell/UK/IBM@IBMGB, "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>, "dfdl-wg-bounces@ogf.org" <dfdl-wg-bounces@ogf.org>
Date:	05/05/2009 15:35
Subject:	Re: [DFDL-WG] Infoset codepage

How about we specify unicode codepoints but implementations can have limitations on the numeric range of codepoints.

Reason: keeps us out of the codepoints vs. encodings morass.

...mikeb

On May 5, 2009, at 10:20 AM, Steve Hanson <smh@uk.ibm.com> wrote:

There is a 4th option - remain silent and leave it up to the implementation.

Reason: Within IBM we have different products that will embed DFDL parser/unparser. WMB requires strings in UTF-16, that is not always the case for others.

Regards

Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet: smh@uk.ibm.com
Phone (+44)/(0) 1962-815848

"Mike Beckerle" <mbeckerle.dfdl@gmail.com>
Sent by: dfdl-wg-bounces@ogf.org

05/05/2009 14:09

Please respond to
mbeckerle.dfdl@gmail.com

To	Alan Powell/UK/IBM@IBMGB, <dfdl-wg@ogf.org>
cc
Subject	[DFDL-WG] Infoset codepage

4. Infoset codepage and encoding

The spec does not say what codepage and encoding is used for string fields.

I wanted to comment on this.

There are three choices here:
1. unicode codepoints - we may need to preserve the mapping table (from representation encoding to unicode) as part of the infoset.
2. "As Encoded" codepoints - we must add the encoding to the infoset.
3. Both
In favor of unicode codepoints - simplicity. Minor issue is that some mappings will lose information making perfect round-tripping of string contents impossible.
E.g., EBCDIC has two different line-endings both of which normally are translated to ASCII/Unicode linefeed. Hence, translating back is ambiguous.

In favor of "as encoded" - simplicity. We just add an encoding attribute to the string infoset object which returns the information that the dfdl:encoding representation property contained. Note that the encoding information really is already available via the schema component associated with the string, so there is some redundancy here. Also, there's the issue when dealing with this of whether one wants codepoints, or raw access to the bytes. E.g., if the encoding is UTF-8 or shifted JIS, then the characters take up 1 or more bytes. Do you want the bytes, or the interpreted code points or both?

In favor of "both" - complexity, but eliminates all the ambiguity.

My suggestion: keep it simple for v1.0 - Choose number 1 - because we can always expand the capabilities later by providing access to the unencoded representation one way or another.

If you badly need infoset-level contents which expose the actual representation character codes, you can always model this as an array of bytes instead of a character string.

...mike

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU