7-bit ascii packed together

I have a data format in front of me that has 64 7-bit ASCII characters, but the format has them bit-packed, i.e., 448 = 7 * 64 bits, so ....the character codes aren't octet/byte aligned. Furthermore, the 'string' either uses up the entire 64 character maximum length OR it has a terminating character which is a 0x7F character code. I believe I was the advocate for a position that character codes should always be 8-bit aligned. That would be because I had never seen anything like this. I am told there are also 6-bit ascii-variations, similarly packed together to save space. BTW: This occurs in a specific US MIL STD message header format, so it's not like it's some obscure unused corner case. Right now, the best I think I can do is to model this data not as a string at all, but as an array of integers, each one having 7-bit length, and not aligned (that is, aligned to 1-bit). Doing that I can use occursCountKind='parsed', and an assertion to deal with the optional termination by 0x7F value. To handle this as a string, we'd need to be able to specify that the character codes are not aligned, and the width of the bit-fields making up each character code. Or I suppose we could just say this is a special kind of character set encoding "ASCII-7-bit-packed" or something. Having that, I could deal with the termination via a choice of either the terminated flavor, or the fixed length flavor (which excludes the terminator) by way of a choice of two strings each having a lengthKind="pattern". Comments? -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412

Hi Mike I thought this would come up at some point, and my assumption has always been that we would handle it using special enums of dfdl:encoding, and for fixed length use lengthUnits 'characters'. That means we can continue with our existing rules for when you use lengthUnits 'bits' and not have to extend them to xs:string. We would disallow lengthUnits 'bytes'. I would suggest that a DFDL parser takes the 7-bits and pads to 8-bits before calling ICU, and the reverse after calling ICU when unparsing. That way we don't need to get ICU to handle this. (I'm assuming they don't, Tim is going to find out). Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 19/10/2012 20:44 Subject: [DFDL-WG] 7-bit ascii packed together Sent by: dfdl-wg-bounces@ogf.org I have a data format in front of me that has 64 7-bit ASCII characters, but the format has them bit-packed, i.e., 448 = 7 * 64 bits, so ....the character codes aren't octet/byte aligned. Furthermore, the 'string' either uses up the entire 64 character maximum length OR it has a terminating character which is a 0x7F character code. I believe I was the advocate for a position that character codes should always be 8-bit aligned. That would be because I had never seen anything like this. I am told there are also 6-bit ascii-variations, similarly packed together to save space. BTW: This occurs in a specific US MIL STD message header format, so it's not like it's some obscure unused corner case. Right now, the best I think I can do is to model this data not as a string at all, but as an array of integers, each one having 7-bit length, and not aligned (that is, aligned to 1-bit). Doing that I can use occursCountKind='parsed', and an assertion to deal with the optional termination by 0x7F value. To handle this as a string, we'd need to be able to specify that the character codes are not aligned, and the width of the bit-fields making up each character code. Or I suppose we could just say this is a special kind of character set encoding "ASCII-7-bit-packed" or something. Having that, I could deal with the termination via a choice of either the terminated flavor, or the fixed length flavor (which excludes the terminator) by way of a choice of two strings each having a lengthKind="pattern". Comments? -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

On our call today we discussed this and agreed we should just pick a name for this encoding which uses the US-ASCII 7-bit codepoints, but instead of using a full octet per, uses exactly 7 bits per codepoint. I suggest this name should be US-ASCII-7-bit-packed. US-ASCII is 'preferred' by IANA over just "ASCII" on the Internet, because it avoids the "ascii family" connotation. Note that according to the IANA encoding names are not case sensitive. On Mon, Oct 22, 2012 at 6:32 AM, Steve Hanson <smh@uk.ibm.com> wrote:
Hi Mike
I thought this would come up at some point, and my assumption has always been that we would handle it using special enums of dfdl:encoding, and for fixed length use lengthUnits 'characters'. That means we can continue with our existing rules for when you use lengthUnits 'bits' and not have to extend them to xs:string. We would disallow lengthUnits 'bytes'.
I would suggest that a DFDL parser takes the 7-bits and pads to 8-bits before calling ICU, and the reverse after calling ICU when unparsing. That way we don't need to get ICU to handle this. (I'm assuming they don't, Tim is going to find out).
Regards
Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848
From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 19/10/2012 20:44 Subject: [DFDL-WG] 7-bit ascii packed together Sent by: dfdl-wg-bounces@ogf.org ------------------------------
I have a data format in front of me that has 64 7-bit ASCII characters, but the format has them bit-packed, i.e., 448 = 7 * 64 bits, so ....the character codes aren't octet/byte aligned.
Furthermore, the 'string' either uses up the entire 64 character maximum length OR it has a terminating character which is a 0x7F character code.
I believe I was the advocate for a position that character codes should always be 8-bit aligned. That would be because I had never seen anything like this.
I am told there are also 6-bit ascii-variations, similarly packed together to save space.
BTW: This occurs in a specific US MIL STD message header format, so it's not like it's some obscure unused corner case.
Right now, the best I think I can do is to model this data not as a string at all, but as an array of integers, each one having 7-bit length, and not aligned (that is, aligned to 1-bit). Doing that I can use occursCountKind='parsed', and an assertion to deal with the optional termination by 0x7F value.
To handle this as a string, we'd need to be able to specify that the character codes are not aligned, and the width of the bit-fields making up each character code. Or I suppose we could just say this is a special kind of character set encoding "ASCII-7-bit-packed" or something.
Having that, I could deal with the termination via a choice of either the terminated flavor, or the fixed length flavor (which excludes the terminator) by way of a choice of two strings each having a lengthKind="pattern".
Comments?
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412
participants (2)
-
Mike Beckerle
-
Steve Hanson