I verified that Java's representation of unicode codepoints above U+FFFF is

(a) an int - for a unicode character code handled outside of a string.
(b) a pair of surrogate codepoints when represented in a java String

There are a variety of methods now that take or return an int which can be up to U+10FFFF, and which interact with either 1 or 2 character codepoints of a String.

For utf-16: A character requiring more than 16 bits is represented as 2 code units of a surrogate pair, and each of those becomes a Java character.

When utf16Width is 'variable' then this surrogate pair counts as 1 unicode character for length in 'characters' purposes. This is the feature I believe should be optional in DFDL.

For utf-8: A character requiring more than 16 bits is also represented as 2 code units of a surrogate pair.

However, we have no property for indicating this surrogate pair counts as 1 unicode character for length in 'characters' units purposes.

For utf-32: Same issue. A single codepoint in utf-32 may have to be represented as a surrogate pair in a java string.

However, we have no property for indicating this surrogate pair counts as 1 unicode character.

...mikeb

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com

Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy