[DFDL-WG] Actions 290 & 291 - variable width characters and utf16Width 'variable'

8 Nov 2016

      I verified that Java's representation of unicode codepoints above U+FFFF is

(a) an int - for a unicode character code handled outside of a string.
(b) a pair of surrogate codepoints when represented in a java String

There are a variety of methods now that take or return an int which can be
up to U+10FFFF, and which interact with either 1 or 2 character codepoints
of a String.

For utf-16:  A character requiring more than 16 bits is represented as 2
code units of a surrogate pair, and each of those becomes a Java character.

When utf16Width is 'variable' then this surrogate pair counts as 1 unicode
character for length in 'characters' purposes. This is the feature I
believe should be optional in DFDL.

For utf-8: A character requiring more than 16 bits is also represented as 2
code units of a surrogate pair.

However, we have no property for indicating this surrogate pair counts as 1
unicode character for length in 'characters' units purposes.

For utf-32: Same issue. A single codepoint in utf-32 may have to be
represented as a surrogate pair in a java string.

However, we have no property for indicating this surrogate pair counts as 1
unicode character.

...mikeb

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF Intellectual Property Policy
<http://www.ogf.org/About/abt_policies.php>