binaryNumberRep limitations, xs:decimal and binaryDecimalVirtualPoint

Rather long email, apologies in advance. We are encountering many issues where DFDL is too limited in its binary number representations. 6 points are made in this email. To not bury the lead, which is Point 6, I am proposing that we make dfdl:binaryNumberRep an extensible enum, allowing QNames as values so that implementations can extend the set of supported binary number representations. E.g., dfdl:binaryNumberRep='dfdlx:onesComplement', or dfdl:binaryNumberRep='daffodil:signedASN1BERVariableLengthInteger' Point 1: I think this is just a needed clarification. But Points 1 and 2 drove the motivation for this whole discussion. So the description of dfdl:binaryDecimalVirtualPoint says it is allowed on types whose *base* is xs:decimal. That term 'base' is confusing. It can either mean allowed on types derived from xs:decimal, as all the signed and unsigned integer types are. Or it can mean literally <restriction base="xs:decimal"> .... </restriction>. That is, the base attribute of simple type restriction must be xs:decimal, or the type="xs:decimal" directly on the element. I believe the latter is the only thing that makes sense. If dfdl:binaryDecimalVirtualPoint is positive, as this type below is nonsense, as the type says it is an unsigned integer, but the binaryDecimalVirtualPoint says to divide by 100, so that 36000 would become 360.00, a non-integer. The infoset <angle>360.00</angle> makes no sense for an integer type, and would not validate treating the DFDL schema as an XSD. <element name="angle" type="with2DecimalFractionDigits"/> <simpleType name="with2DecimalFractionDigits" dfdl:binaryDecimalVirtualPoint="2"> <!-- will divide by 100 --> <restriction base="xs:unsignedShort"> <!-- makes no sense. Has to be xs:decimal --> <maxInclusive value="360"/> </restriction> </simpleType> So based on that reasoning, I think we need to improve the clarity and say that dfdl:binaryDecimalVirtualPoint applies only to type xs:decimal alone. (Removing the word 'base' which creates the subtype confusion.) That brings me to the next related point. Point 2: Need for binaryNumberRep='unsignedBinary' We have a use case we cannot express. An angle from 0 to 360 degrees is represented by an unsigned 16 bit binary integer which if dfdl:binaryDecimalVirtualPoint works with that, can represent 000.00 to 360.00. <simpleType name="angle360" dfdl:lengthKind="explicit" dfdl:length="16" dfdl:binaryNumberRep="binary" dfdl:binaryDecimalVirtualPoint="2"> <restriction base="xs:decimal"> <minInclusive value="0"/> <maxInclusive value="360"/> </restriction> </simpleType> However, we have no way to say that the above is to be unsigned binary integer representation. dfdl:binaryNumberRep='binary' means unsigned binary if the type is unsigned, and means signed twos-complement if the type is signed, which xs:decimal is. So my definition of angle360 above is no good, as the maximum positive value is 327.67, which is insufficient. I think we need to revise dfdl:binaryNumberRep to allow for distinguishing binary unsigned from twosComplement signed types, as well as the packed types. Note that the packed types allow 'bcd' which is an unsigned representation, so there is sort of a precedent there for allowing the binaryNumberRep type to be unsigned even if the type is signed-capable. There is already a proposal to add "offsetBinary" as a signed binary integer representation. https://github.com/OpenGridForum/DFDL/issues/7 So I'm suggesting adding "unsignedBinary' as well. So the complete set (so far) would be dfdl:binaryNumberRep: - 'twosComplementBinary' (with legacy 1.0 name 'binary') - 'unsignedBinary' - 'offsetBinary' - 'packed' - 'bcd' - 'ibm4690Packed' Point 3: There are other binary integer representations that will be needed. This table comes from a format specification we use: [image: image.png] Ignore the 'Logical' column above, that's about enums. Ignore the "*" which is just about when a value must be reserved as an in-band null indicator which is the suggested such value. What is called 'Mod Twos Complement' here is what our existing proposed DFDL 2.0 feature calls 'offsetBinary'. So this table suggests the need for 'unsignedBinary' (already mentioned), but also two others: 'signPlusMagnitudeBinary', and 'onesComplementBinary'. Point 4: Zig Zag Integer representation is getting popular There's one other representation I know of which is more recent/modern called zig-zag integers, popularized by google protocol buffers, but it's a clever representation and seemingly used in many places now. *Binary Value* *Zig Zag* 000 0 001 -1 010 1 011 -2 100 2 101 -3 110 3 111 -4 Point 5: Variable Length Binary Integers There are also variable-length integer formats that are not just strings of bits. A common one I have seen is used by ASN.1 BER representation where each byte if its MSB is 1 indicates that the integer extends an additional byte, contributing 7 bits to the value. Unsigned integers are just the concatenation of these bits. Signed integers are handled after the bits are concatenated together. If the first bit of the concatenation is 1, the value is twos complement negative value. Hence, if a positive value would have a first bit of 1, then an additional byte containing 10000000 must be used as the most significant byte so that the first bit will not be 1. There is no way in DFDL to represent such a variable length integer representation and get an integer in the infoset. You have to use a hexBinary byte array. There is a need for a variable-length integer like this to support not only explicit length (used by ASN.1 BER), but implicit length as well. In this case the last byte of the variable length integer does not have the MSB set. Hence, a single byte can represent signed -64 to +63, or unsigned 0 to 127. Outside that range multiple bytes must be used, each byte contributing 7 bits. This suggests a need for several additional dfdl:binaryNumberRep enums. Point 6: Extensibility by implementations is needed here There are many other representations out there as well. I think we should have a convention where there is a core set that all DFDL representations must provide, and a convention by which DFDL implementations can provide additional support. To me, a good way to do this is to allow the enum values for dfdl:binaryNumberRep to be not only regular enums (all of which are reserved) but QName syntax, where the prefix can be for a namespace recognized by an implementation for providing an extended set of binary number representations. (Perhaps the dfdlx: prefix and namespace, or maybe we just allow implementation specific namespaces?) This means of extending enums for existing properties is not part of our existing 'experimental features' conventions, but I propose that it should be added. To me, this is a good way to generally allow property enums to be extended with experimental features in DFDL implementations, and applies to other places such as dfdl:binaryCalendarRep, and numerous other properties where we are finding a need for additional enums and want to add them as experimental features. That was long. Thanks for your consideration. Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com

DFDL WG agree to - clarify the meaning of 'base simple type' for the property dfdl:binaryDecimalVirtualPoint (added to https://github.com/OpenGridForum/DFDL/issues/28) - extend the experimental feature syntax to allow additional enums on existing properties (action 328) - add extra enums to dfdl:binaryNumberRep as a DFDL 1.0 experimental feature (action 329) Regards Steve Hanson IBM Integration, Hursley, UK Architect, IBM DFDL Co-Chair, OGF DFDL Working Group smh@uk.ibm.com<mailto:smh@uk.ibm.com> tel:+44-7717-378890 Note: I work Tuesday to Friday -----Original Message----- From: Mike Beckerle <mbeckerle@apache.org<mailto:Mike%20Beckerle%20%3cmbeckerle@apache.org%3e>> Reply-To: mbeckerle@apache.org<mailto:mbeckerle@apache.org> To: DFDL-WG <dfdl-wg@ogf.org<mailto:DFDL-WG%20%3cdfdl-wg@ogf.org%3e>> Subject: [EXTERNAL] [DFDL-WG] binaryNumberRep limitations, xs:decimal and binaryDecimalVirtualPoint Date: Wed, 08 Jun 2022 13:59:16 -0400 Rather long email, apologies in advance. We are encountering many issues where DFDL is too limited in its binary number representations. 6 points are made in this email. To not bury the lead, which is Point 6, I am proposing that we make dfdl:binaryNumberRep ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization. ZjQcmQRYFpfptBannerEnd Rather long email, apologies in advance. We are encountering many issues where DFDL is too limited in its binary number representations. 6 points are made in this email. To not bury the lead, which is Point 6, I am proposing that we make dfdl:binaryNumberRep an extensible enum, allowing QNames as values so that implementations can extend the set of supported binary number representations. E.g., dfdl:binaryNumberRep='dfdlx:onesComplement', or dfdl:binaryNumberRep='daffodil:signedASN1BERVariableLengthInteger' Point 1: I think this is just a needed clarification. But Points 1 and 2 drove the motivation for this whole discussion. So the description of dfdl:binaryDecimalVirtualPoint says it is allowed on types whose base is xs:decimal. That term 'base' is confusing. It can either mean allowed on types derived from xs:decimal, as all the signed and unsigned integer types are. Or it can mean literally <restriction base="xs:decimal"> .... </restriction>. That is, the base attribute of simple type restriction must be xs:decimal, or the type="xs:decimal" directly on the element. I believe the latter is the only thing that makes sense. If dfdl:binaryDecimalVirtualPoint is positive, as this type below is nonsense, as the type says it is an unsigned integer, but the binaryDecimalVirtualPoint says to divide by 100, so that 36000 would become 360.00, a non-integer. The infoset <angle>360.00</angle> makes no sense for an integer type, and would not validate treating the DFDL schema as an XSD. <element name="angle" type="with2DecimalFractionDigits"/> <simpleType name="with2DecimalFractionDigits" dfdl:binaryDecimalVirtualPoint="2"> <!-- will divide by 100 --> <restriction base="xs:unsignedShort"> <!-- makes no sense. Has to be xs:decimal --> <maxInclusive value="360"/> </restriction> </simpleType> So based on that reasoning, I think we need to improve the clarity and say that dfdl:binaryDecimalVirtualPoint applies only to type xs:decimal alone. (Removing the word 'base' which creates the subtype confusion.) That brings me to the next related point. Point 2: Need for binaryNumberRep='unsignedBinary' We have a use case we cannot express. An angle from 0 to 360 degrees is represented by an unsigned 16 bit binary integer which if dfdl:binaryDecimalVirtualPoint works with that, can represent 000.00 to 360.00. <simpleType name="angle360" dfdl:lengthKind="explicit" dfdl:length="16" dfdl:binaryNumberRep="binary" dfdl:binaryDecimalVirtualPoint="2"> <restriction base="xs:decimal"> <minInclusive value="0"/> <maxInclusive value="360"/> </restriction> </simpleType> However, we have no way to say that the above is to be unsigned binary integer representation. dfdl:binaryNumberRep='binary' means unsigned binary if the type is unsigned, and means signed twos-complement if the type is signed, which xs:decimal is. So my definition of angle360 above is no good, as the maximum positive value is 327.67, which is insufficient. I think we need to revise dfdl:binaryNumberRep to allow for distinguishing binary unsigned from twosComplement signed types, as well as the packed types. Note that the packed types allow 'bcd' which is an unsigned representation, so there is sort of a precedent there for allowing the binaryNumberRep type to be unsigned even if the type is signed-capable. There is already a proposal to add "offsetBinary" as a signed binary integer representation. https://github.com/OpenGridForum/DFDL/issues/7<https://github.com/OpenGridForum/DFDL/issues/7> So I'm suggesting adding "unsignedBinary' as well. So the complete set (so far) would be dfdl:binaryNumberRep: * 'twosComplementBinary' (with legacy 1.0 name 'binary') * 'unsignedBinary' * 'offsetBinary' * 'packed' * 'bcd' * 'ibm4690Packed' Point 3: There are other binary integer representations that will be needed. This table comes from a format specification we use: [image.png] Ignore the 'Logical' column above, that's about enums. Ignore the "*" which is just about when a value must be reserved as an in-band null indicator which is the suggested such value. What is called 'Mod Twos Complement' here is what our existing proposed DFDL 2.0 feature calls 'offsetBinary'. So this table suggests the need for 'unsignedBinary' (already mentioned), but also two others: 'signPlusMagnitudeBinary', and 'onesComplementBinary'. Point 4: Zig Zag Integer representation is getting popular There's one other representation I know of which is more recent/modern called zig-zag integers, popularized by google protocol buffers, but it's a clever representation and seemingly used in many places now. Binary Value Zig Zag 000 0 001 -1 010 1 011 -2 100 2 101 -3 110 3 111 -4 Point 5: Variable Length Binary Integers There are also variable-length integer formats that are not just strings of bits. A common one I have seen is used by ASN.1 BER representation where each byte if its MSB is 1 indicates that the integer extends an additional byte, contributing 7 bits to the value. Unsigned integers are just the concatenation of these bits. Signed integers are handled after the bits are concatenated together. If the first bit of the concatenation is 1, the value is twos complement negative value. Hence, if a positive value would have a first bit of 1, then an additional byte containing 10000000 must be used as the most significant byte so that the first bit will not be 1. There is no way in DFDL to represent such a variable length integer representation and get an integer in the infoset. You have to use a hexBinary byte array. There is a need for a variable-length integer like this to support not only explicit length (used by ASN.1 BER), but implicit length as well. In this case the last byte of the variable length integer does not have the MSB set. Hence, a single byte can represent signed -64 to +63, or unsigned 0 to 127. Outside that range multiple bytes must be used, each byte contributing 7 bits. This suggests a need for several additional dfdl:binaryNumberRep enums. Point 6: Extensibility by implementations is needed here There are many other representations out there as well. I think we should have a convention where there is a core set that all DFDL representations must provide, and a convention by which DFDL implementations can provide additional support. To me, a good way to do this is to allow the enum values for dfdl:binaryNumberRep to be not only regular enums (all of which are reserved) but QName syntax, where the prefix can be for a namespace recognized by an implementation for providing an extended set of binary number representations. (Perhaps the dfdlx: prefix and namespace, or maybe we just allow implementation specific namespaces?) This means of extending enums for existing properties is not part of our existing 'experimental features' conventions, but I propose that it should be added. To me, this is a good way to generally allow property enums to be extended with experimental features in DFDL implementations, and applies to other places such as dfdl:binaryCalendarRep, and numerous other properties where we are finding a need for additional enums and want to add them as experimental features. That was long. Thanks for your consideration. Mike Beckerle Apache Daffodil PMC | daffodil.apache.org<http://daffodil.apache.org/> OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl<http://www.ogf.org/ogf/doku.php/standards/dfdl/dfdl> Owl Cyber Defense | www.owlcyberdefense.com<http://www.owlcyberdefense.com/> -- dfdl-wg mailing list dfdl-wg@ogf.org<mailto:dfdl-wg@ogf.org> https://www.ogf.org/mailman/listinfo/dfdl-wg Unless otherwise stated above: IBM United Kingdom Limited Registered in England and Wales with number 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU

We need the same extensibility thing for binaryCalendarRep. The binarySeconds existing value is signed. We recently found we needed a 32-bit unsigned version, and it is quite hard to work around this without adding lots of user-defined functions such as date/time/datetime constructors that take integer arguments. I've not yet seen a need to extend binaryFloatRep. The binaryBooleanTrueRep and binaryBooleanFalseRep are not analogous as they are values, not enums of kinds of representations. On Thu, Jul 14, 2022 at 12:43 PM Steve Hanson <smh@uk.ibm.com> wrote:
DFDL WG agree to - clarify the meaning of 'base simple type' for the property dfdl:binaryDecimalVirtualPoint (added to https://github.com/OpenGridForum/DFDL/issues/28) - extend the experimental feature syntax to allow additional enums on existing properties (action 328) - add extra enums to dfdl:binaryNumberRep as a DFDL 1.0 experimental feature (action 329)
Regards
Steve Hanson
IBM Integration, Hursley, UK Architect, IBM DFDL Co-Chair, OGF DFDL Working Group smh@uk.ibm.com tel:+44-7717-378890 Note: I work Tuesday to Friday
-----Original Message----- *From*: Mike Beckerle <mbeckerle@apache.org <Mike%20Beckerle%20%3cmbeckerle@apache.org%3e>> *Reply-To*: mbeckerle@apache.org *To*: DFDL-WG <dfdl-wg@ogf.org <DFDL-WG%20%3cdfdl-wg@ogf.org%3e>> *Subject*: [EXTERNAL] [DFDL-WG] binaryNumberRep limitations, xs:decimal and binaryDecimalVirtualPoint *Date*: Wed, 08 Jun 2022 13:59:16 -0400
Rather long email, apologies in advance. We are encountering many issues where DFDL is too limited in its binary number representations. 6 points are made in this email. To not bury the lead, which is Point 6, I am proposing that we make dfdl:binaryNumberRep ZjQcmQRYFpfptBannerStart This Message Is From an External Sender This message came from outside your organization.
ZjQcmQRYFpfptBannerEnd Rather long email, apologies in advance.
We are encountering many issues where DFDL is too limited in its binary number representations.
6 points are made in this email.
To not bury the lead, which is Point 6, I am proposing that we make dfdl:binaryNumberRep an extensible enum, allowing QNames as values so that implementations can extend the set of supported binary number representations. E.g., dfdl:binaryNumberRep='dfdlx:onesComplement', or dfdl:binaryNumberRep='daffodil:signedASN1BERVariableLengthInteger'
Point 1: I think this is just a needed clarification. But Points 1 and 2 drove the motivation for this whole discussion.
So the description of dfdl:binaryDecimalVirtualPoint says it is allowed on types whose *base* is xs:decimal.
That term 'base' is confusing. It can either mean allowed on types derived from xs:decimal, as all the signed and unsigned integer types are. Or it can mean literally <restriction base="xs:decimal"> .... </restriction>. That is, the base attribute of simple type restriction must be xs:decimal, or the type="xs:decimal" directly on the element.
I believe the latter is the only thing that makes sense. If dfdl:binaryDecimalVirtualPoint is positive, as this type below is nonsense, as the type says it is an unsigned integer, but the binaryDecimalVirtualPoint says to divide by 100, so that 36000 would become 360.00, a non-integer. The infoset <angle>360.00</angle> makes no sense for an integer type, and would not validate treating the DFDL schema as an XSD.
<element name="angle" type="with2DecimalFractionDigits"/>
<simpleType name="with2DecimalFractionDigits" dfdl:binaryDecimalVirtualPoint="2"> <!-- will divide by 100 --> <restriction base="xs:unsignedShort"> <!-- makes no sense. Has to be xs:decimal --> <maxInclusive value="360"/> </restriction> </simpleType>
So based on that reasoning, I think we need to improve the clarity and say that dfdl:binaryDecimalVirtualPoint applies only to type xs:decimal alone. (Removing the word 'base' which creates the subtype confusion.)
That brings me to the next related point.
Point 2: Need for binaryNumberRep='unsignedBinary'
We have a use case we cannot express.
An angle from 0 to 360 degrees is represented by an unsigned 16 bit binary integer which if dfdl:binaryDecimalVirtualPoint works with that, can represent 000.00 to 360.00.
<simpleType name="angle360" dfdl:lengthKind="explicit" dfdl:length="16"
dfdl:binaryNumberRep="binary" dfdl:binaryDecimalVirtualPoint="2">
<restriction base="xs:decimal">
<minInclusive value="0"/>
<maxInclusive value="360"/>
</restriction>
</simpleType>
However, we have no way to say that the above is to be unsigned binary integer representation. dfdl:binaryNumberRep='binary' means unsigned binary if the type is unsigned, and means signed twos-complement if the type is signed, which xs:decimal is.
So my definition of angle360 above is no good, as the maximum positive value is 327.67, which is insufficient.
I think we need to revise dfdl:binaryNumberRep to allow for distinguishing binary unsigned from twosComplement signed types, as well as the packed types.
Note that the packed types allow 'bcd' which is an unsigned representation, so there is sort of a precedent there for allowing the binaryNumberRep type to be unsigned even if the type is signed-capable.
There is already a proposal to add "offsetBinary" as a signed binary integer representation. https://github.com/OpenGridForum/DFDL/issues/7 So I'm suggesting adding "unsignedBinary' as well.
So the complete set (so far) would be dfdl:binaryNumberRep:
- 'twosComplementBinary' (with legacy 1.0 name 'binary') - 'unsignedBinary' - 'offsetBinary' - 'packed' - 'bcd' - 'ibm4690Packed'
Point 3: There are other binary integer representations that will be needed.
This table comes from a format specification we use:
[image: image.png] Ignore the 'Logical' column above, that's about enums. Ignore the "*" which is just about when a value must be reserved as an in-band null indicator which is the suggested such value.
What is called 'Mod Twos Complement' here is what our existing proposed DFDL 2.0 feature calls 'offsetBinary'.
So this table suggests the need for 'unsignedBinary' (already mentioned), but also two others: 'signPlusMagnitudeBinary', and 'onesComplementBinary'.
Point 4: Zig Zag Integer representation is getting popular
There's one other representation I know of which is more recent/modern called zig-zag integers, popularized by google protocol buffers, but it's a clever representation and seemingly used in many places now.
*Binary Value* *Zig Zag* 000 0 001 -1 010 1 011 -2 100 2 101 -3 110 3 111 -4
Point 5: Variable Length Binary Integers
There are also variable-length integer formats that are not just strings of bits. A common one I have seen is used by ASN.1 BER representation where each byte if its MSB is 1 indicates that the integer extends an additional byte, contributing 7 bits to the value. Unsigned integers are just the concatenation of these bits.
Signed integers are handled after the bits are concatenated together. If the first bit of the concatenation is 1, the value is twos complement negative value. Hence, if a positive value would have a first bit of 1, then an additional byte containing 10000000 must be used as the most significant byte so that the first bit will not be 1.
There is no way in DFDL to represent such a variable length integer representation and get an integer in the infoset. You have to use a hexBinary byte array.
There is a need for a variable-length integer like this to support not only explicit length (used by ASN.1 BER), but implicit length as well. In this case the last byte of the variable length integer does not have the MSB set. Hence, a single byte can represent signed -64 to +63, or unsigned 0 to 127. Outside that range multiple bytes must be used, each byte contributing 7 bits.
This suggests a need for several additional dfdl:binaryNumberRep enums.
Point 6: Extensibility by implementations is needed here
There are many other representations out there as well.
I think we should have a convention where there is a core set that all DFDL representations must provide, and a convention by which DFDL implementations can provide additional support.
To me, a good way to do this is to allow the enum values for dfdl:binaryNumberRep to be not only regular enums (all of which are reserved) but QName syntax, where the prefix can be for a namespace recognized by an implementation for providing an extended set of binary number representations. (Perhaps the dfdlx: prefix and namespace, or maybe we just allow implementation specific namespaces?)
This means of extending enums for existing properties is not part of our existing 'experimental features' conventions, but I propose that it should be added.
To me, this is a good way to generally allow property enums to be extended with experimental features in DFDL implementations, and applies to other places such as dfdl:binaryCalendarRep, and numerous other properties where we are finding a need for additional enums and want to add them as experimental features.
That was long. Thanks for your consideration.
Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless otherwise stated above:
IBM United Kingdom Limited Registered in England and Wales with number 741598 Registered office: PO Box 41, North Harbour, Portsmouth, Hants. PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
participants (2)
-
Mike Beckerle
-
Steve Hanson