I verified that Java's representation of unicode codepoints above U+FFFF is
(a) an int - for a unicode character code handled outside of a string.
(b) a pair of surrogate codepoints when represented in a java String
There are a variety of methods now that take or return an int which can be up to U+10FFFF, and which interact with either 1 or 2 character codepoints of a String.
For utf-16: A character requiring more than 16 bits is represented as 2 code units of a surrogate pair, and each of those becomes a Java character.
When utf16Width is 'variable' then this surrogate pair counts as 1 unicode character for length in 'characters' purposes. This is the feature I believe should be optional in DFDL.
For utf-8: A character requiring more than 16 bits is also represented as 2 code units of a surrogate pair.
However, we have no property for indicating this surrogate pair counts as 1 unicode character for length in 'characters' units purposes.
For utf-32: Same issue. A single codepoint in utf-32 may have to be represented as a surrogate pair in a java string.
However, we have no property for indicating this surrogate pair counts as 1 unicode character.
...mikeb