I think we've got a fix for this.

I found an official reference which has no "greyed out" codepoints. All 256 values are "mapped". 
The following ftp table (see URL below) officially defines the mapping for 8859-1 to unicode/iso10646.

The table includes all 256 codepoints - some are specified as just <control> i.e., have no specific meaning, but their 8859 codepoint maps one-to-one and onto a unicode/10646 codepoint with the same value.

Note that this property holds for 8859-1. It does not hold for 8859-2 to 8859-16, as these have character codes substituted into them that map to other places in the iso10646 codepoint space.

Here's the correspondence table: 

ftp://ftp.unicode.org/Public/MAPPINGS/ISO8859/8859-1.TXT

If we reference this mapping table in the references of the DFDL spec, then I believe we can say that using encoding="iso-8859-1", you can treat binary data as textual, use patterns, etc., and the relationship to/from the infoset always insures preservation of the values of the bytes (parsing), and creation of bytes whose values exactly match the string codepoints (unparsing).

This language can be added to the section on lengthKind="pattern" and binary data: 

Binary data can be handled using some of the conveniences of text by way of treating it as text with encoding="iso-8859-1". In this case literal text, such as length patterns, is interpreted as in the iso-8859-1 character encoding, and the correspondence of byte values in the data to a string in the DFDL infoset is one to one. That is, byte with value N, produces an infoset character with character code N.  [reference to above FTP site]. 

On Thu, Feb 11, 2010 at 5:32 AM, Steve Hanson <smh@uk.ibm.com> wrote:

Mike

In the wikipedia entry for ISO 10646 it says "The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimize conflicts with other encoding forms."  If those code points are below 256 then we have the same problem as 8859?  I can't find an actual map of the 10646 code points - you have to buy it from ISO.

Regards

Steve Hanson
Programming Model Architect, WebSphere Message Broker,
OGF DFDL WG Co-Chair,
Hursley, UK,
Internet: smh@uk.ibm.com,
Phone (+44)/(0) 1962-815848



From: Mike Beckerle <mbeckerle.dfdl@gmail.com>
To: dfdl-wg@ogf.org
Date: 11/02/2010 00:38
Subject: [DFDL-WG] 8-bit-ascii for dealing with binary data in text-like        manner - problematic
Sent by: dfdl-wg-bounces@ogf.org






Every "8-bit-ascii" encoding I can find has holes in the code page. That is, values that don't have a corresponding character codepoint assigned.

Example: iso-8859-X are a bunch of 8-bit ascii-based encodings that are popular.

If you lookup iso-8859-1 it has this language:

Code values 00–1F, 7F–9F are not assigned to characters by ISO/IEC 8859-1.

The lower range 20 to 7E (the G0 subset) maps exactly to the same coded G0 subset of the ISO 646 US variant (commonly known as ASCII), ...

They're saying 7-bit ascii is included, and some other codes are there, but they don't assign a codepoint generally.

So, to me suggesting use of any particular code page for this purpose is somewhat ambiguous. E.g., what does &#x01 mean in a string if the encoding is iso-8859-1? There appears to be a set of translation tables that assign this to unicode in standard ways that one can find on the web. But the codepoint doesn't have an assigned meaning in iso-8859-X standards.

Two possible clarifications:
1) for all ascii-based character sets, we say that bytes 0x00 to 0xFF all map to exactly those codepoints in ISO 10646 for the infoset, and vice versa.

2) define dfdl:encoding="bytes" as a special character set name which has the above property.

Personally, I prefer 2. It is simpler to explain what is going on, and when people are depending on bytes it will be clearer that they are.

...mike

--
 dfdl-wg mailing list
 dfdl-wg@ogf.org
 
http://www.ogf.org/mailman/listinfo/dfdl-wg







Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU