ICU and dfdl:textNumberPattern issue

We've run into an issue with dfdl:textNumberPattern, which is an ICU number pattern. I'll discuss here, and then suggest this is a fix needed in DFDL generally, but we should discuss that hypothesis. The motivating example is fixed length 5 character integer text data. The data ranges from -9999 to 99999. Note that the minus sign uses up one of the 5 characters that can be a digit for positive values. Consider the value -123 and textNumberPattern of "00000;-0". The value unparses as -00123 which is length 6 so too long. The padding feature of ICU number patterns can be used to "fix" this. Consider textNumberPattern="*0####0". The "*0" notation means to use 0 as the padding character to replace the "#" when needed. Now the value 123 unparses as 00123 but ... here's the problem.... -123 unparses as 0-123. Notice how the zero padding is before the minus sign when we wanted it to appear after. This problem is caused by ICU taking nearly all the information from the positive part of the textNumberPattern. The negative part of the pattern, if it exists, is used only to define the affix (prefix or suffix or both) that indicate negative values. The problem is that positive numbers commonly have no affix, so the position of padding characters relative to the affix cannot always be determined from the positive pattern alone. Hence, if textNumberPattern specifies a pad character before the number pattern and without a positive prefix, then ICU defaults to a pad position of PAD_BEFORE_PREFIX with no way to change it with just the pattern. This behavior is reasonable for most cases, like when the pad character is a space. However, if the pad character in textNumberPattern is '0', then negative numbers are padded with a '0' before the negative sign. So we get the errant behavior where a pattern of "*0####0" unparses -123 to "0-123". This is very unlikely to be what the user wants with this pattern. Now suppose the positive pattern required a prefix "+" sign. The textNumberPattern of "+*0####0" works properly because ICU determines that the padding is PAD_AFTER_PREFIX from the positive pattern where the "*0" is after the "+" prefix. The proposed fix to this issue that we're implementing in Daffodil is this: If both negative and positive patterns define padding on the same affix, and the positive pattern has an empty string for that affix, then we use the pad position from the negative pattern. In all other cases, the pad character in the negative pattern is ignored following usual ICU behavior. For example, a textNumberPattern of "*0####0;-*00" formats a negative number with zero padding after the hyphen, whereas normal ICU behavior would ignore the negative pattern and zero pad before the hyphen. I would suggest this is something that is a needed fix in general for DFDL. The workaround when you have the fixed length number use case (-9999 to 99999) is very ugly, treating the minus sign as an initiator and creating separate elements for positive and negative values, or punting on this integer and treating the whole thing as a string. Arguably, this might be considered an ICU bug, but ICU's API can be used to specify the PAD_AFTER_PREFIX behavior, so it's not like ICU doesn't let you achieve the needed behavior, it's just not something that can be achieved using only the ICU pattern string. But ICU maintainers may or may not consider this to be a bug. Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com

Ok, after a week agonizing over this problem, a simple textNumberPattern has been found (not by me) which works for this 5 characters -9999 to 99999 problem. dfdl:textNumberPattern="*0#0000" This works because the pad character ICU uses is zero, but it is only needed as a pad character for positive numbers. Negative numbers will be 4 digits to start with, and the "#" will become the minus sign and no padding is needed. The suggested fix described in this thread is still needed for obscure situations like where the padding is a non-numeric character like spaces after the sign, so for numbers like "+ 1234.56", or even "+ xxxx1234.56", where 'x' as padding needs to be removed on number parsing. Follow up question though. Has anyone ever seen *data* like that, with the "x" for leading numeric unused digits? ( I don't mean printed on a check - we've all seen that, I mean in data.) This representation is clearly possible. I just don't know of any real-world example of it. I made it up based on having seen checks printed that way. I have definitely seen "+ 1234.56" where the sign is first in the field regardless of the length of the number. I've also seen " 1234.56+" where the sign is trailing. But I can't say I've seen "+ xxxx1234.56" in data. On Thu, Jan 11, 2024 at 1:25 PM Mike Beckerle <mbeckerle@apache.org> wrote:
We've run into an issue with dfdl:textNumberPattern, which is an ICU number pattern. I'll discuss here, and then suggest this is a fix needed in DFDL generally, but we should discuss that hypothesis.
The motivating example is fixed length 5 character integer text data. The data ranges from -9999 to 99999. Note that the minus sign uses up one of the 5 characters that can be a digit for positive values.
Consider the value -123 and textNumberPattern of "00000;-0". The value unparses as -00123 which is length 6 so too long.
The padding feature of ICU number patterns can be used to "fix" this. Consider textNumberPattern="*0####0". The "*0" notation means to use 0 as the padding character to replace the "#" when needed. Now the value 123 unparses as 00123 but ... here's the problem.... -123 unparses as 0-123. Notice how the zero padding is before the minus sign when we wanted it to appear after.
This problem is caused by ICU taking nearly all the information from the positive part of the textNumberPattern. The negative part of the pattern, if it exists, is used only to define the affix (prefix or suffix or both) that indicate negative values.
The problem is that positive numbers commonly have no affix, so the position of padding characters relative to the affix cannot always be determined from the positive pattern alone.
Hence, if textNumberPattern specifies a pad character before the number pattern and without a positive prefix, then ICU defaults to a pad position of PAD_BEFORE_PREFIX with no way to change it with just the pattern.
This behavior is reasonable for most cases, like when the pad character is a space. However, if the pad character in textNumberPattern is '0', then negative numbers are padded with a '0' before the negative sign. So we get the errant behavior where a pattern of "*0####0" unparses -123 to "0-123". This is very unlikely to be what the user wants with this pattern.
Now suppose the positive pattern required a prefix "+" sign. The textNumberPattern of "+*0####0" works properly because ICU determines that the padding is PAD_AFTER_PREFIX from the positive pattern where the "*0" is after the "+" prefix.
The proposed fix to this issue that we're implementing in Daffodil is this: If both negative and positive patterns define padding on the same affix, and the positive pattern has an empty string for that affix, then we use the pad position from the negative pattern. In all other cases, the pad character in the negative pattern is ignored following usual ICU behavior.
For example, a textNumberPattern of "*0####0;-*00" formats a negative number with zero padding after the hyphen, whereas normal ICU behavior would ignore the negative pattern and zero pad before the hyphen.
I would suggest this is something that is a needed fix in general for DFDL.
The workaround when you have the fixed length number use case (-9999 to 99999) is very ugly, treating the minus sign as an initiator and creating separate elements for positive and negative values, or punting on this integer and treating the whole thing as a string.
Arguably, this might be considered an ICU bug, but ICU's API can be used to specify the PAD_AFTER_PREFIX behavior, so it's not like ICU doesn't let you achieve the needed behavior, it's just not something that can be achieved using only the ICU pattern string. But ICU maintainers may or may not consider this to be a bug.
Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com

I seem to recall hitting the same problem with X12 fixed length numerics which could be positive or negative. If I get time before our next call I'll re-familiarise myself with the X12 issue and what I did to work round it. Regards Steve On Thu, Jan 11, 2024 at 6:25 PM Mike Beckerle <mbeckerle@apache.org> wrote:
We've run into an issue with dfdl:textNumberPattern, which is an ICU number pattern. I'll discuss here, and then suggest this is a fix needed in DFDL generally, but we should discuss that hypothesis.
The motivating example is fixed length 5 character integer text data. The data ranges from -9999 to 99999. Note that the minus sign uses up one of the 5 characters that can be a digit for positive values.
Consider the value -123 and textNumberPattern of "00000;-0". The value unparses as -00123 which is length 6 so too long.
The padding feature of ICU number patterns can be used to "fix" this. Consider textNumberPattern="*0####0". The "*0" notation means to use 0 as the padding character to replace the "#" when needed. Now the value 123 unparses as 00123 but ... here's the problem.... -123 unparses as 0-123. Notice how the zero padding is before the minus sign when we wanted it to appear after.
This problem is caused by ICU taking nearly all the information from the positive part of the textNumberPattern. The negative part of the pattern, if it exists, is used only to define the affix (prefix or suffix or both) that indicate negative values.
The problem is that positive numbers commonly have no affix, so the position of padding characters relative to the affix cannot always be determined from the positive pattern alone.
Hence, if textNumberPattern specifies a pad character before the number pattern and without a positive prefix, then ICU defaults to a pad position of PAD_BEFORE_PREFIX with no way to change it with just the pattern.
This behavior is reasonable for most cases, like when the pad character is a space. However, if the pad character in textNumberPattern is '0', then negative numbers are padded with a '0' before the negative sign. So we get the errant behavior where a pattern of "*0####0" unparses -123 to "0-123". This is very unlikely to be what the user wants with this pattern.
Now suppose the positive pattern required a prefix "+" sign. The textNumberPattern of "+*0####0" works properly because ICU determines that the padding is PAD_AFTER_PREFIX from the positive pattern where the "*0" is after the "+" prefix.
The proposed fix to this issue that we're implementing in Daffodil is this: If both negative and positive patterns define padding on the same affix, and the positive pattern has an empty string for that affix, then we use the pad position from the negative pattern. In all other cases, the pad character in the negative pattern is ignored following usual ICU behavior.
For example, a textNumberPattern of "*0####0;-*00" formats a negative number with zero padding after the hyphen, whereas normal ICU behavior would ignore the negative pattern and zero pad before the hyphen.
I would suggest this is something that is a needed fix in general for DFDL.
The workaround when you have the fixed length number use case (-9999 to 99999) is very ugly, treating the minus sign as an initiator and creating separate elements for positive and negative values, or punting on this integer and treating the whole thing as a string.
Arguably, this might be considered an ICU bug, but ICU's API can be used to specify the PAD_AFTER_PREFIX behavior, so it's not like ICU doesn't let you achieve the needed behavior, it's just not something that can be achieved using only the ICU pattern string. But ICU maintainers may or may not consider this to be a bug.
Mike Beckerle Apache Daffodil PMC | daffodil.apache.org OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl Owl Cyber Defense | www.owlcyberdefense.com
-- dfdl-wg mailing list dfdl-wg@lists.ogf.org https://lists.ogf.org/mailman/listinfo/dfdl-wg
participants (2)
-
Mike Beckerle
-
Steve Hanson