I went over this issue again mentally.

Here's what I came up with. Note I am using fixed-width font because of some ascii-art in this email.

So one thing we realized the other day is we need at least this much amending of the proposal.

Changing what xs:hexBinary means when dfdl:lengthUnits='bits' would be binary incompatible. Right now there are schemas with xs:hexBinary in them where dfdl:lengthUnits='bits' is in scope, but is being ignored because DFDL v1.0 says it doesn't apply to hexBinary.

So at minimum we need a property to switch on bits-centric behavior for xs:hexBinary.

Next, we know that XSD constrains things. The length facets are applicable to hexBinary and are always measured in bytes.
Hence, lexically, there should only EVER be an even number of hex digits in a hexBinary, and if the facets are present, then the length units cannot be bits or the values of the facets would be misleading.

So even if the number of bits is 17, you should get 6 hex digits, not 5. 

(I think XML validators may fail on odd number of hex digits. Not necessarily all of them, but some may.)

Third, there's no debate that bitOrder matters. The question is only about whether byteOrder should matter.

Given that then I think there are two possible interpretations of hexBinary.
I'll call them the "byte string" way, and the "binary number" way.

THE BYTE STRING WAY

The following would be invariants

* byte order doesn't matter ever
* if the hexBinary's representation is aligned to an 8-bit boundary and is a muliple of 8-bits long, then the logical value is the same regardless of bitOrder.

Consider this data stream as hex bytes DE AD BE EF.

Regardless of bit order, all 32 bits taken together, starting on a byte boundary, the only hexBinary rep would be <foo>DEADBEEF</foo>

Now consider we start at bit 5 (1-based numbering) and proceed for only 24 bits. So we're not going to consume the first 4 bits, nor the last 4 bits. Where first and last here are relative to the bitOrder.

When bitOrder is MSBF, we would want the data to be <foo>EADBEE</foo>

When bitOrder is LSBF, we would want the data to be <foo>DDEAFB</foo>  (Write the whole bytes backwards, drop first and last nibble, then reverse again).

Now consider we start at bit 6 and proceed for 22 bits.

when bitOrder is MSBF we would want the data to be ....
  D    E    A    D    B    E    E    F
  1101 1110 1010 1101 1011 1110 1110 1111
  xxxx x110 1010 1101 1011 1110 111x xxxx
        D    5    B    7    D    C
<foo>D5B7DC</foo>
Note to get the final C, we had to extend the final byte with 2 zero bits, and this is done by shift left/pad on right (least significant side)

when bit order is LSBF we wuld want the data to be....
  D    E    A    D    B    E    E    F
  1101 1110 1010 1101 1011 1110 1110 1111
reverse the bytes (not the nibbles, the bytes)
  E    F    B    E    A    D    D    E
  1110 1111 1011 1110 1010 1101 1101 1110
  xxxx x111 1011 1110 1010 1101 110x xxxx
         3    D    F    5    6    E
Now reverse the bytes again
<foo>6EF53D</foo>
Note to get the 3 in the final byte we had to assume 2 zero bits on the left (most significant side).

In the above, we're effectively treating hexBinary as a sequence of 8-bit integers, followed by a less-than-8-bit integer if the length is not a mulitple of 8 bits, and this less than 8-bit integer gets adjusted to be a full byte in a bitOrder aware way. We don't need byte order because we're never considering a number that occupies more than 8-bits at a time.

THE BINARY NUMBER WAY

The second way to do hexBinary would be to effectively treat it as a minor variation on a xs:nonNegativeInteger with binaryNumberRep='binary'.

In this case, if the bytes are DEADBEEF, and the byte order is bigEndian, the string is <foo>DEADBEEF</foo>, but if byteOrder is littleEndian the string is <foo>EFBEADDE</foo>

In this case byteOrder matters. Bit order didn't matter because we were dealing with whole bytes.
We are always going to represent 2-digits for each byte of length (rounding up for the final byte). So for 3 bytes, as if the textNumberPattern was "000000".
So there will be leading zeros sometimes, (Also we use hex digits,... goes without saying.)

If we consider the first example above, DEADBEEF where we remove first and last nibbles, then

when bitOrder MSBF and byteOrder bigEndian - no change from above
when bitOrder MSBF and byteOrder littleEndian - <foo>EEDBEA</foo> (reversed from above)
when bitOrder LSBF and byteOrder littleEndian - <foo>FBEADD</foo> (reversed from above)
when bitOrder LSBF and byteOrder bigEndian (Not allowed in DFDL now) - no change from above.

Revisiting the 22-bit long examples from above, but adding byteOrder to them,

when bitOrder MSBF and byteOrder bigEndian - no change from above
when bitOrder MSBF and byteOrder littleEndian - <foo>DCB7D5</foo> (reversed from above)
when bitOrder LSBF and byteOrder littleEndian - <foo>3DF56E</foo> (reversed from above)
when bitOrder LSBF and byteOrder bigEndian (Not allowed in DFDL now) - no change from above.

My evaluation of this is that the numeric treatment here is actually a bit problematic because a hexBinary is not a number represented in base 16 - conceptually it is a byte array.

If I look at the XML infoset, first pair of hex digits (leftmost) I expect to be able to look at the data stream, and find that bit pattern. True I must know the bitOrder. But if I throw byte order into the mix, I potentially have to go to the end of the hexBinary (and these can be quite big. Could be screenfuls or megabytes of data away) to find the hex digits that correspond to the current location in the data stream.

This is no different than for a base 10 number, but because those are base 10 I'm never going to be doing that for a giant base 10 number.

Conclusion.

I see no advantage to the BINARY NUMBER way over the BYTE STRING way. It changes what you get based on byte order which seems unnecessary.  I think the added flexibility is not required.




Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy



On Tue, Dec 4, 2018 at 4:10 AM Steve Hanson <smh@uk.ibm.com> wrote:
I agree that bitOrder is needed, not byteOrder.  If you want to parse the data as an integer, then fine but that is not the case here, you are parsing the data as hexBinary. The analogy is with your parsing of text strings where the encoding is one where the character size is not a multiple of 8 bytes; you use bitOrder but not byteOrder.

Regards
 
Steve Hanson

IBM Hybrid Integration, Hursley, UK
Architect,
IBM DFDL
Co-Chair,
OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday




From:        Stephen Lawrence <slawrence@tresys.com>
To:        Steve Hanson <smh@uk.ibm.com>, "mbeckerle.dfdl@gmail.com" <mbeckerle.dfdl@gmail.com>
Cc:        DFDL-WG <dfdl-wg@ogf.org>
Date:        30/11/2018 18:10
Subject:        Re: [DFDL-WG] Action 292 - version 2 proposal for hexBinary with lengthUnits bits





As an example of why I feel bitOrder and byteOrder apply if supporting
hexBinary with non-byte size lengths or starting on non-byte boundaries,
let's say we we had the following data:

 11011111 11010001 = 0xDFD1

And we want to model this as one 12-bit unsigned int followed by one
4-bit unsigned int, all with bitOrder=LSBF and byteOrder=LE. We would
have a schema like so:

 <dfdl:format
   lengthKind="explicit"
   lengthUnits="bits"
   bitOrder="leastSignifigantBitFirst"
   byteOrder="littleEndian" />

 <xs:sequence>
   <xs:element name="foo" dfdl:length="12" type="xs:unsignedInt" />
   <xs:element name="bar" dfdl:length="4" type="xs:unsignedInt" />
 </xs:sequence>

The above data would parse as:

 <foo>479</foo> <!-- binary: 000111011111, hex 0x1DF -->
 <bar>13</bar> <!-- binary: 1101, hex 0xD -->

This is because due to the bit/byteOrder, "foo" is made up of the last
four bits in second byte (0001) followed by the first eight bits of the
first byte (11011111), resulting in a value of 479. The bitPosition
after "foo" is consumed is 12. Then "bar" consumes the remaining bits,
which are the first four of the second byte, resulting in a value of 13.

This all follows the specification as-is.


Now, let's assume we instead wanted to represent "foo" as xs:hexBinary
that has a non-byte size length, e.g.:

 <xs:sequence>
   <xs:element name="foo" dfdl:length="12" type="xs:hexBinary" />
   <xs:element name="bar" dfdl:length="4" type="xs:unsignedInt" />
 </xs:sequence>

If we ignored bitOrder/bytOrder when parsing "foo" read the first 12
bits (essentially BE MSBF), the result would be:

 <foo>0DFD</foo>

But just like before, the bitPosition after "foo" is consumed is 12. And
because the bit/byteOrder is LSBF LE, the bits that "bar" will consume
are again the first four of the second byte, with the result

 <bar>13</bar>

But this means that the last four bits in the data (0001) were never
consumed, and the first four bits in the second byte were consumed
twice, which must be wrong (a similar issue occurs when starting on a
non-byte boundary). So bitOrder/byteOrder must be taken into account
somehow in order to support hexBinary with non-bytesize lengths or
starting on a non-byte boundary, primarily because of how bitOrder=LSBF
works (which I believe was the original use-case for non-byte size
non-byte boundary hexBinary).

If instead we do not ignore bit/byteOrder, there must be some way to
determine how to get those bits into a hexBinary representation. There
are probably a few different ways to handle this, but after some
discussions and interpretations of the XSD spec, we determined that the
best way to handle it was to just read the bits as if they were a
nonNegativeInteger (which does take into account bit/byteOrder) and then
convert those bits to a hex representation. For BE MSBF the result is
exactly the same. For LE MBSF, it results in the hexBinary being
flipped, which is where the Daffodil implementation is inconsistent with
spec.




On 11/29/18 10:19 AM, Steve Hanson wrote:
> Mike
>
> I'm a bit lost on this now.  The concept of applying lengthUnits='bits' to
> xs:hexBinary is straightforward. It just counts bits. Bit order or byte order is
> irrelevant, in the same way that it is irrelevant when counting bytes for a hex
> binary. The only thing to note is that the fillByte needs to be used to make up
> whole bytes.
>
> I'm missing something here.
>
> Regards
>
> Steve Hanson
>
> IBM Hybrid Integration, Hursley, UK
> Architect, _IBM DFDL_ <
http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> Co-Chair, _OGF DFDL Working Group_ <
http://www.ogf.org/dfdl/>_
> __smh@uk.ibm.com_ <
mailto:smh@uk.ibm.com>
> tel:+44-1962-815848
> mob:+44-7717-378890
> Note: I work Tuesday to Friday
>
>
>
> From: Mike Beckerle <mbeckerle.dfdl@gmail.com>
> To: DFDL-WG <dfdl-wg@ogf.org>
> Date: 20/11/2018 17:33
> Subject: [DFDL-WG] Action 292 - version 2 proposal for hexBinary with      
>   lengthUnits bits
> Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org>
>
> --------------------------------------------------------------------------------
>
>
>
> Users want a way to express an arbitrary unaligned string of bits, with the
> appearance in the infoset being hexadecimal, not base 10.
>
> Right now the only way I can see to meet this requirement while retaining
> backward compatibility would be a new DFDL property.
>
> So here's the new idea:
>
> Property dfdl:hexBinaryRep with values 'bytes' or 'bits'. New property, so
> defaulting (with suppressible warning) to 'bytes' for backward compatibility in
> schemas not having the property.
>
> When set to 'bits', then type xs:hexBinary would behave just like
> xs:nonNegativeInteger, and all properties relevant to that type would be
> applicable, and any use of XSD length facets on such elements would be an SDE.  
> The hexBinary string would be exactly same as if you took the numeric value for
> a nonNegativeInteger and instead of presenting it as base 10 digits, you use
> base 16 digits.
>
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
> _www.tresys.com_ <
http://www.tresys.com>
> Please note: Contributions to the DFDL Workgroup's email discussions are subject
> to the _OGF Intellectual Property Policy_
> <
http://www.ogf.org/About/abt_policies.php>
> --
>   dfdl-wg mailing list
>   dfdl-wg@ogf.org
>
https://www.ogf.org/mailman/listinfo/dfdl-wg
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
>
>
> --
>   dfdl-wg mailing list
>   dfdl-wg@ogf.org
>  
https://www.ogf.org/mailman/listinfo/dfdl-wg
>




Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU