Re: [DFDL-WG] Infoset codepage

5 May 2009

      The problem with choice 2 is that when you have a string with an encoding,
then there's the issue of what do you encounter when you index into the
string at say position 3. Do you get the 3rd byte of the encoding? or is the
encoding somehow decoded into individual character codepoints ... but for
many encodings that's not crisply defined.

If we go with choice 2 we should flat out say that the string is an array of
bytes representing a string by way of the encoding.

There's a variation we didn't explore which is that implementations can
supply the strings in whatever form they want. But they make the encoding
available. This allows an implementation to provide say, UTF-16 always, if
it chooses. 

I'm in favor of the simplest possible thing here. So, for example, if you
guys have a UTF-16 constraint, then I'd be happy just picking that as the
encoding that is always used by the infoset.

...mike

Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc.
Tel:  781-810-2125  | 100 Fifth Ave., 4th Floor, Waltham MA 02451 |
<mailto:mbeckerle.dfdl@gmail.com> mbeckerle.dfdl@gmail.com 

  _____  

From: Alan Powell [mailto:alan_powell@uk.ibm.com] 
Sent: Tuesday, May 05, 2009 11:14 AM
To: DFDL
Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org; Steve Hanson
Subject: Re: [DFDL-WG] Infoset codepage

Isn't choice 2 the most flexible? The caller can convert to what they need. 

Alan Powell

MP 211, IBM UK Labs, Hursley,  Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM     email: alan_powell@uk.ibm.com  
Tel: +44 (0)1962 815073                  Fax: +44 (0)1962 816898

From: 	DFDL <mbeckerle.dfdl@gmail.com> 

To: 	Steve Hanson/UK/IBM@IBMGB 

Cc: 	Alan Powell/UK/IBM@IBMGB, "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>,
"dfdl-wg-bounces@ogf.org" <dfdl-wg-bounces@ogf.org> 

Date: 	05/05/2009 15:35 

Subject: 	Re: [DFDL-WG] Infoset codepage

  _____  

How about we specify unicode codepoints but implementations can have
limitations on the numeric range of codepoints.   

Reason: keeps us out of the codepoints vs. encodings morass. 

...mikeb 

On May 5, 2009, at 10:20 AM, Steve Hanson < <mailto:smh@uk.ibm.com>
smh@uk.ibm.com> wrote:

There is a 4th option - remain silent and leave it up to the implementation.

Reason:  Within IBM we have different products that will embed DFDL
parser/unparser. WMB requires strings in UTF-16, that is not always the case
for others. 

Regards

Steve Hanson
Programming Model Architect
WebSphere Message Brokers
Hursley, UK
Internet:  <mailto:smh@uk.ibm.com>  <mailto:smh@uk.ibm.com> smh@uk.ibm.com
Phone (+44)/(0) 1962-815848 

"Mike Beckerle" < <mailto:mbeckerle.dfdl@gmail.com>
mbeckerle.dfdl@gmail.com> 
Sent by:  <mailto:dfdl-wg-bounces@ogf.org>  <mailto:dfdl-wg-bounces@ogf.org>
dfdl-wg-bounces@ogf.org 

05/05/2009 14:09 

Please respond to
 <mailto:mbeckerle.dfdl@gmail.com>  <mailto:mbeckerle.dfdl@gmail.com>
mbeckerle.dfdl@gmail.com

To
Alan Powell/UK/IBM@IBMGB, < <mailto:dfdl-wg@ogf.org> dfdl-wg@ogf.org> 

cc

Subject
[DFDL-WG] Infoset codepage

4. Infoset codepage and encoding 

The spec does not say what codepage and encoding is used for string fields. 

I wanted to comment on this. 

There are three choices here: 
1.        unicode codepoints - we may need to preserve the mapping table
(from representation encoding to unicode) as part of the infoset. 
2.        "As Encoded" codepoints  - we must add the encoding to the
infoset. 
3.        Both 
In favor of unicode codepoints - simplicity. Minor issue is that some
mappings will lose information making perfect round-tripping of string
contents impossible. 
E.g., EBCDIC has two different line-endings both of which normally are
translated to ASCII/Unicode linefeed. Hence, translating back is ambiguous. 

In favor of "as encoded" - simplicity. We just add an encoding attribute to
the string infoset object which returns the information that the
dfdl:encoding representation property contained. Note that the encoding
information really is already available via the schema component associated
with the string, so there is some redundancy here. Also, there's the issue
when dealing with this of whether one wants codepoints, or raw access to the
bytes. E.g., if the encoding is UTF-8 or shifted JIS, then the characters
take up 1 or more bytes. Do you want the bytes, or the interpreted code
points or both? 

In favor of "both" - complexity, but eliminates all the ambiguity. 

My suggestion: keep it simple for v1.0 - Choose number 1 - because we can
always expand the capabilities later by providing access to the unencoded
representation one way or another. 

If you badly need infoset-level contents which expose the actual
representation character codes, you can always model this as an array of
bytes instead of a character string. 

...mike 

Mike Beckerle | OGF DFDL WG Co-Chair | CTO | Oco, Inc.
Tel:  781-810-2125  | 100 Fifth Ave., 4th Floor, Waltham MA 02451 |
<mailto:mbeckerle.dfdl@gmail.com> mbeckerle.dfdl@gmail.com --
dfdl-wg mailing list
 <mailto:dfdl-wg@ogf.org>  <mailto:dfdl-wg@ogf.org> dfdl-wg@ogf.org
 <http://www.ogf.org/mailman/listinfo/dfdl-wg>
<http://www.ogf.org/mailman/listinfo/dfdl-wg>
http://www.ogf.org/mailman/listinfo/dfdl-wg 

  _____  

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

  _____  

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU