8-bit-ascii for dealing with binary data in text-like manner - problematic

Every "8-bit-ascii" encoding I can find has holes in the code page. That is, values that don't have a corresponding character codepoint assigned. Example: iso-8859-X are a bunch of 8-bit ascii-based encodings that are popular. If you lookup iso-8859-1 it has this language: Code values 00–1F, 7F–9F are not assigned to characters by ISO/IEC 8859-1. The lower range 20 to 7E (the G0 subset) maps exactly to the same coded G0 subset of the ISO 646 US variant (commonly known as ASCII<http://en.wikipedia.org/wiki/ASCII>), ... They're saying 7-bit ascii is included, and some other codes are there, but they don't assign a codepoint generally. So, to me suggesting use of any particular code page for this purpose is somewhat ambiguous. E.g., what does mean in a string if the encoding is iso-8859-1? There appears to be a set of translation tables that assign this to unicode in standard ways that one can find on the web. But the codepoint doesn't have an assigned meaning in iso-8859-X standards. Two possible clarifications: 1) for all ascii-based character sets, we say that bytes 0x00 to 0xFF all map to exactly those codepoints in ISO 10646 for the infoset, and vice versa. 2) define dfdl:encoding="bytes" as a special character set name which has the above property. Personally, I prefer 2. It is simpler to explain what is going on, and when people are depending on bytes it will be clearer that they are. ...mike
participants (1)
-
Mike Beckerle