Mike
In the wikipedia entry for ISO 10646 it says "The system deliberately leaves many code points not assigned to characters, even in the BMP. It does this to allow for future expansion or to minimize conflicts with other encoding forms." If those code points are below 256 then we have the same problem as 8859? I can't find an actual map of the 10646 code points - you have to buy it from ISO.
Regards
Steve Hanson
Programming Model Architect, WebSphere Message Broker,
OGF DFDL WG Co-Chair,
Hursley, UK,
Internet: smh@uk.ibm.com,
Phone (+44)/(0) 1962-815848
From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org Date: 11/02/2010 00:38 Subject: [DFDL-WG] 8-bit-ascii for dealing with binary data in text-like manner - problematic Sent by: dfdl-wg-bounces@ogf.org
Every "8-bit-ascii" encoding I can find has holes in the code page. That is, values that don't have a corresponding character codepoint assigned.
Example: iso-8859-X are a bunch of 8-bit ascii-based encodings that are popular.
If you lookup iso-8859-1 it has this language:
Code values 00–1F, 7F–9F are not assigned to characters by ISO/IEC 8859-1.--The lower range 20 to 7E (the G0 subset) maps exactly to the same coded G0 subset of the ISO 646 US variant (commonly known as ASCII), ...
They're saying 7-bit ascii is included, and some other codes are there, but they don't assign a codepoint generally.
So, to me suggesting use of any particular code page for this purpose is somewhat ambiguous. E.g., what does  mean in a string if the encoding is iso-8859-1? There appears to be a set of translation tables that assign this to unicode in standard ways that one can find on the web. But the codepoint doesn't have an assigned meaning in iso-8859-X standards.
Two possible clarifications:
1) for all ascii-based character sets, we say that bytes 0x00 to 0xFF all map to exactly those codepoints in ISO 10646 for the infoset, and vice versa.
2) define dfdl:encoding="bytes" as a special character set name which has the above property.
Personally, I prefer 2. It is simpler to explain what is going on, and when people are depending on bytes it will be clearer that they are.
...mike
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU