I seem to recall hitting the same problem with X12 fixed length numerics which could be positive or negative. If I get time before our next call I'll re-familiarise myself with the X12 issue and what I did to work round it.

Regards
Steve

On Thu, Jan 11, 2024 at 6:25 PM Mike Beckerle <mbeckerle@apache.org> wrote:
We've run into an issue with dfdl:textNumberPattern, which is an ICU number pattern. I'll discuss here, and then suggest this is a fix needed in DFDL generally, but we should discuss that hypothesis. 

The motivating example is fixed length 5 character integer text data. The data ranges from -9999 to 99999. Note that the minus sign uses up one of the 5 characters that can be a digit for positive values. 

Consider the value -123 and textNumberPattern of "00000;-0". The value unparses as -00123 which is length 6 so too long.

The padding feature of ICU number patterns can be used to "fix" this. Consider textNumberPattern="*0####0". The "*0" notation means to use 0 as the padding character to replace the "#" when needed.
Now the value 123 unparses as 00123 but ... here's the problem.... -123 unparses as 0-123. Notice how the zero padding is before the minus sign when we wanted it to appear after.

This problem is caused by ICU taking nearly all the information from the positive part of the textNumberPattern. The negative part of the pattern, if it exists, is used only to define the affix (prefix or suffix or both) that indicate negative values. 

The problem is that positive numbers commonly have no affix, so the position of padding characters relative to the affix cannot always be determined from the positive pattern alone. 

Hence, if textNumberPattern specifies a pad character before the number pattern and without a positive prefix, then ICU defaults to a pad position of PAD_BEFORE_PREFIX with no way to change it with just the pattern. 

This behavior is reasonable for most cases, like when the pad character is a space. However, if the pad character in textNumberPattern is '0', then negative numbers are padded with a '0' before the negative sign. So we get the errant behavior where a pattern of "*0####0" unparses -123 to "0-123". This is very unlikely to be what the user wants with this pattern.

Now suppose the positive pattern required a prefix "+" sign. The textNumberPattern of "+*0####0" works properly because ICU determines that the padding is PAD_AFTER_PREFIX from the positive pattern where the "*0" is after the "+" prefix. 

The proposed fix to this issue that we're implementing in Daffodil is this: If both negative and positive patterns define padding on the same affix, and the positive pattern has an empty string for that affix, then we use the pad position from the negative pattern. In all other cases, the pad character in the negative pattern is ignored following usual ICU behavior.

For example, a textNumberPattern of "*0####0;-*00" formats a negative number with zero padding after the hyphen, whereas normal ICU behavior would ignore the negative pattern and zero pad before the hyphen.


I would suggest this is something that is a needed fix in general for DFDL. 


The workaround when you have the fixed length number use case (-9999 to 99999) is very ugly, treating the minus sign as an initiator and creating separate elements for positive and negative values, or punting on this integer and treating the whole thing as a string. 


Arguably, this might be considered an ICU bug, but ICU's API can be used to specify the PAD_AFTER_PREFIX behavior, so it's not like ICU doesn't let you achieve the needed behavior, it's just not something that can be achieved using only the ICU pattern string. But ICU maintainers may or may not consider this to be a bug. 



Mike Beckerle 
Apache Daffodil PMC | daffodil.apache.org
OGF DFDL Workgroup Co-Chair | www.ogf.org/ogf/doku.php/standards/dfdl/dfdl
Owl Cyber Defense | www.owlcyberdefense.com


--
  dfdl-wg mailing list
  dfdl-wg@lists.ogf.org
  https://lists.ogf.org/mailman/listinfo/dfdl-wg