Thanks for the response re encodings and issues. Very helpful.

I'm trying to help identify spec issues that may impact implementation
efforts. So we are on the same codepage! (Sorry, I used to belong to Mensa.)

The current spec mentions encodings: UTF-8, ebcdic-cp-us (IBM037), and UTF-16BE.

Since Java 1.6 supports 160 encodings using 686 aliases I've no doubt you see the reason for my
question about which encodings need initial support.

ICU supports even more encodings and requiring some of these could implicitly require 
implementors to support ICU. Not an issue if that is truly needed but that requirement
alone could dissuade some from participating in the project.

The encodings I have examined/tested so far are: US-ASCII, ISO-8859-1,
UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, IBM1047,
IBM500, IBM037, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM.

I have not run across any issues with any of the above encodings.

ICU includes 175 UCM files of which 135 are for SBCS encodings. I have not tested
or examined all of these but would not expect them to be an issue either.

Also not examined are the 27 UCM files for MBCS encodings. A brief review shows that
many of these should not be an issue.

BIG5 or GB18030 could definitely be an issue and there are several others like these that might
require a custom effort to support. Ok if you really need it but better delayed initially if you don't.

Glad to know we don't need to visit these for the short term. I'm sure we want implementers to
concentrate on the DFDL aspect of things rather than become encoding experts (which I'm certainly not).

But when a business user says "I want to support various encodings" my role as a sanity checker 
is to make sure I know whether we are talking about 160 encodings, including the odd balls, or a
small number of basic ones.

Then I can assure my tech manager that we know the scope (and resource) limits. I can't be sure but I strongly
suspect (creeping) scope issues had a pronounced impact on earlier implementation efforts.

I'm certainly not trying to (needlessly) limit your scope; just trying to make sure we know what it is.

RE #1: creating test cases.

I don't agree with your comment that for 'All other cases you don't need to know anything other
than the character set width'. To do any robust testing you need to create a set of test cases for
each encoding that are specific to that encoding. For UTF-8 you need to have byte arrays that have
characters represented for 1, 2, 3 and 4 byte encodings. For UTF-16 you need strings with byte order marks
and without and characters that use the basic and extended code units.

I don't even know yet to what extent recovery will be possible if a binary file contains invalid string
data. There may be certain types of errors that prevent recovery so that the remaining bytes of the file
cannot be processed at all because the processor is permanently out-of-sync.

Also see the 'Test Cases' section at the end where I discuss the need
to share test cases with multiple XSD files.

RE #3: I don't understand your comment 'You can't just search the data for a pattern as you may get a false
match on binary data.'. If you know where the string starts and that it is not allowed to include the
binary byte sequence that matches the terminator then why can't you search for the terminator byte sequence
from the beginning of the string? There can't be any binary data in the middle of the string can there?

RE #4: I agree that it is not clear if using different encodings for a string and terminator will cause
a problem. Is there a specific use case you guys have that makes you want to include this in version 1?
If not, why not defer it for a later version and for version 1 disallow different encodings.

Testing

There are at least three types of tests that need to be done: validation tests of DFDL schemas, the
testing of the basic ability to parse and unparse data and the testing of various schema syntax combinations
such as scoping (appliesTo) and nested property inheritance. Testing might include specifying
byte order or other properties using different scoping to test that the proper 'flattened' properties
are being used or that data errors are being reported because the wrong properties are being used.

Since DFDL itself is new there will need to be a set of 'validation' tests that can be used to test sample
XSD schemas. These tests would not actually use any data files. I would expect to find areas
where property or other annotations either overlap or present conflicting validation issues.

Some types of schema validation problems will need to be fixed before some of the more complex data tests
can even be conducted.

Creating Test Cases (of course this is my opinion only but is what I require and do for unit and
integration testing):

1. Here is an initial guesstimate of what you need just for the primitives.
  A. List of primitives: byte, short, etc. both signed and unsigned and with each possible byte order
  B. List of primitive terminators: other primitives, specified length and end of data. Specified length
        means that there may be unused 'pad' space after the data item.
  C. There should be a unique test for each primitive and combo by itself: that is, a file consisting of
     one data element.
  D. There should be tests for each of the terminating conditions.
  E. Tests for various 'format' definitions. An 'int' might have an implied decimal position or
     leading/trailing zeros.
  F. There should be test files that include invalid data. For example, three bytes provided for an int
     when there should be four. These 'invalid' tests are needed to make sure the exception handling and error
     reporting works properly.
  G. For testing 'parse' functionality you need the basic binary files as just described and you also need
     XML test files that contain the expected results of the parse operation.
  H. For testing 'unparse' functionality you need test files in XML infoset format and expected result files
     in binary format.

2. For strings
  A. Tests for length-prefixed strings
  B. Tests for fixed-length strings
  C. Tests for strings with terminator sequences
  D. Various combinations of alignment and padding

3. Arrays and the like
  A. Tests with various numbers of elements
  B. Tests with different types of elements
  
4. Bit testing

So one reason a test case is not just a 'a combination of data and schema' is the need to perform both
'parse' and 'unparse' testing. Testing for these two directions needs to be done independently.

The typical order for testing is: use case, test file for input, result file for output. The input file for the
test would be created based on the use case and you would manually create the expected XML output file. Then you 
would do a 'walk through' of the parse using the XSD schema file for reference. Once this looks to be in order
you would run the test using the test harness to verify that the results are as expected. After that the test 
can be added to the test suite and included in automated test runs.

It is advantageous to keep test data and files separate from schema files for several reasons. It allows
test data to be created independently from schema creation. Anyone could be creating test data files even
now before any actual schemas can even be written let alone executed.

It also allows simple test cases to be combined. A simple test file containing only an 'int' and a simple
test file containing only a 'double' can be concatenated to create a more complext test file. In a lot of
cases this lets the more complex test data be created from combinations of simpler test files. All of these
tests can then be inventoried and controlled a lot more easily if they are distinct entities.


Even for a single DFDL XSD schema file you need to conduct both 'parse' and 'unparse' tests and for each 
test you need to provide proper input data as well as expected output data. This is necessary in order to
perform automated testing and test reporting.

The set of tests needed for the complex types (length-prefixed strings, terminated strings, fixed-width strings,
various padding combinations and the like) will be similar. Then there will be combination tests.

Some test cases with be used by multiple schema files. One schema might provide a default value for a required
field (with a test file that is missing the required field) and another might have no default value. The first
should parse fine and the second should give an error of some sort.

Another reason some test cases need to be shared by multiple XSD files is for thorough testing of files with strings
that may or may not contain BOMs.There will be schema files that specify different property settings but process
the same set of test files. A test file containing only ASCII should be able to be processed by a schema file that
specifies an encoding of US-ASCII, ISO-8859-1 or UTF-8.

Some UTF encodings allow, but do not require, a byte order mark (BOM) to be
present in the byte stream. These must be detected and accounted for in order to properly position
the stream pointer to the character after the string itself.

Here's a spec twister for you: if an encoding requires a BOM but there is none in the byte stream
is it a parse error? Or, since the XSD writer can specify LITTLE-ENDIAN should it NOT be an error.
On unparse a BOM will be generated if required.