Q - What is the current thinking for the character set encodings that MUST be implemented by a conforming DFDL processor for version 1?

I have been performing tests with length-prefixed strings and strings using terminators to see what issues affect the ability to detect the boundaries between strings and binary data or terminator strings that immediately follow the string.

For length-prefixed strings you need to be able to either encode the byte array and iterate the string character by character or perform byte counting using only the byte stream and the bit ranges in the bytes themselves..

Issue #1 – It will not be trivial to create all of the test cases to fully test the corner cases for each encoding. Obviously the fewer encodings that have to be supported initially the better in terms of implementation.

Issue #2 - There is no current support for byte counting in Java or ICU. For encodings that are pure single-byte or pure multi-byte the end of the string can be found by examining the byte string itself without performing character encoding. The classes available all perform conversions of entire buffers (or series of buffers) and the classes also consume large amounts of the byte stream.

For some encodings (e.g. UTF-8) an algorithmic process can examine byte values and determine if a character consumes 1, 2 or more bytes.

Still other encodings will need to have custom processes written to either encode and iterate the string or use a specially designed table to perform byte counting.

As with issue #1 the fewer encodings needing special handling that need to be supported initially the fewer problems for implementers.

Issue #3 – some encodings have multiple possible byte representations for the same character. If a terminator string specified as ‘END’ in a DFDL property it must be converted to the proper encoding when searching for it. The easiest way to do this is to encode it, convert the encoded value to a byte array and then search the input stream byte array for a match. The binary file could include bytes that express one encoding of the character and the Java code could be searching for the character using another byte representation.

Q – Does the DFDL spec need to allow a terminator to be specified as a hex byte array so that the exact byte sequence to search for can be specified?

Issue #4 – If a string can be specified as using one encoding and a terminator can use a different encoding is it possible that the terminator byte sequence is also a valid string byte sequence even though the characters being represented are different? I haven’t been able to determine if this can happen.

Q – Does the DFDL spec need to disallow different encodings for strings and terminators for version 1? Or are you confident that this corner case is unlikely to be an issue.

I have been in contact with Addison Phillips, the current chair of the W3C Internationalization core WG, and he ran into many of the above issues when implementing character set providers for WebMethos (since consumed by SoftwareAG). He also referred me to a contact at ICU and I hope to hear from them in the next week or two.

Meanwhile, any thoughts or suggestions you have on the above would be appreciated.

While I am waiting for feedback from ICU and Addison I am trying to determine an effective way to set up an automated test harness that can be used to generate different combinations of strings, terminators and encodings and perform volume testing. Mike suggested using the test example he provided but it only showed one data string for input. That might be adequate for simple tests but, because test cases may need to be shared by multiple test XSD files it may not be scalable for volume testing or testing of multiple cases.

Rick