
I verified that Java's representation of unicode codepoints above U+FFFF is (a) an int - for a unicode character code handled outside of a string. (b) a pair of surrogate codepoints when represented in a java String There are a variety of methods now that take or return an int which can be up to U+10FFFF, and which interact with either 1 or 2 character codepoints of a String. For utf-16: A character requiring more than 16 bits is represented as 2 code units of a surrogate pair, and each of those becomes a Java character. When utf16Width is 'variable' then this surrogate pair counts as 1 unicode character for length in 'characters' purposes. This is the feature I believe should be optional in DFDL. For utf-8: A character requiring more than 16 bits is also represented as 2 code units of a surrogate pair. However, we have no property for indicating this surrogate pair counts as 1 unicode character for length in 'characters' units purposes. For utf-32: Same issue. A single codepoint in utf-32 may have to be represented as a surrogate pair in a java string. However, we have no property for indicating this surrogate pair counts as 1 unicode character. ...mikeb Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy <http://www.ogf.org/About/abt_policies.php>