Thanks for the response re encodings and issues. Very helpful. I'm trying to help identify spec issues that may impact implementation efforts. So we are on the same codepage! (Sorry, I used to belong to Mensa.) The current spec mentions encodings: UTF-8, ebcdic-cp-us (IBM037), and UTF-16BE. Since Java 1.6 supports 160 encodings using 686 aliases I've no doubt you see the reason for my question about which encodings need initial support. ICU supports even more encodings and requiring some of these could implicitly require implementors to support ICU. Not an issue if that is truly needed but that requirement alone could dissuade some from participating in the project. The encodings I have examined/tested so far are: US-ASCII, ISO-8859-1, UTF-8, UTF-16, UTF-16BE, UTF-16LE, UTF-32, UTF-32BE, UTF-32LE, IBM1047, IBM500, IBM037, x-UTF-16LE-BOM, X-UTF-32BE-BOM, X-UTF-32LE-BOM. I have not run across any issues with any of the above encodings. ICU includes 175 UCM files of which 135 are for SBCS encodings. I have not tested or examined all of these but would not expect them to be an issue either. Also not examined are the 27 UCM files for MBCS encodings. A brief review shows that many of these should not be an issue. BIG5 or GB18030 could definitely be an issue and there are several others like these that might require a custom effort to support. Ok if you really need it but better delayed initially if you don't. Glad to know we don't need to visit these for the short term. I'm sure we want implementers to concentrate on the DFDL aspect of things rather than become encoding experts (which I'm certainly not). But when a business user says "I want to support various encodings" my role as a sanity checker is to make sure I know whether we are talking about 160 encodings, including the odd balls, or a small number of basic ones. Then I can assure my tech manager that we know the scope (and resource) limits. I can't be sure but I strongly suspect (creeping) scope issues had a pronounced impact on earlier implementation efforts. I'm certainly not trying to (needlessly) limit your scope; just trying to make sure we know what it is. RE #1: creating test cases. I don't agree with your comment that for 'All other cases you don't need to know anything other than the character set width'. To do any robust testing you need to create a set of test cases for each encoding that are specific to that encoding. For UTF-8 you need to have byte arrays that have characters represented for 1, 2, 3 and 4 byte encodings. For UTF-16 you need strings with byte order marks and without and characters that use the basic and extended code units. I don't even know yet to what extent recovery will be possible if a binary file contains invalid string data. There may be certain types of errors that prevent recovery so that the remaining bytes of the file cannot be processed at all because the processor is permanently out-of-sync. Also see the 'Test Cases' section at the end where I discuss the need to share test cases with multiple XSD files. RE #3: I don't understand your comment 'You can't just search the data for a pattern as you may get a false match on binary data.'. If you know where the string starts and that it is not allowed to include the binary byte sequence that matches the terminator then why can't you search for the terminator byte sequence from the beginning of the string? There can't be any binary data in the middle of the string can there? RE #4: I agree that it is not clear if using different encodings for a string and terminator will cause a problem. Is there a specific use case you guys have that makes you want to include this in version 1? If not, why not defer it for a later version and for version 1 disallow different encodings. Testing There are at least three types of tests that need to be done: validation tests of DFDL schemas, the testing of the basic ability to parse and unparse data and the testing of various schema syntax combinations such as scoping (appliesTo) and nested property inheritance. Testing might include specifying byte order or other properties using different scoping to test that the proper 'flattened' properties are being used or that data errors are being reported because the wrong properties are being used. Since DFDL itself is new there will need to be a set of 'validation' tests that can be used to test sample XSD schemas. These tests would not actually use any data files. I would expect to find areas where property or other annotations either overlap or present conflicting validation issues. Some types of schema validation problems will need to be fixed before some of the more complex data tests can even be conducted. Creating Test Cases (of course this is my opinion only but is what I require and do for unit and integration testing): 1. Here is an initial guesstimate of what you need just for the primitives. A. List of primitives: byte, short, etc. both signed and unsigned and with each possible byte order B. List of primitive terminators: other primitives, specified length and end of data. Specified length means that there may be unused 'pad' space after the data item. C. There should be a unique test for each primitive and combo by itself: that is, a file consisting of one data element. D. There should be tests for each of the terminating conditions. E. Tests for various 'format' definitions. An 'int' might have an implied decimal position or leading/trailing zeros. F. There should be test files that include invalid data. For example, three bytes provided for an int when there should be four. These 'invalid' tests are needed to make sure the exception handling and error reporting works properly. G. For testing 'parse' functionality you need the basic binary files as just described and you also need XML test files that contain the expected results of the parse operation. H. For testing 'unparse' functionality you need test files in XML infoset format and expected result files in binary format. 2. For strings A. Tests for length-prefixed strings B. Tests for fixed-length strings C. Tests for strings with terminator sequences D. Various combinations of alignment and padding 3. Arrays and the like A. Tests with various numbers of elements B. Tests with different types of elements 4. Bit testing So one reason a test case is not just a 'a combination of data and schema' is the need to perform both 'parse' and 'unparse' testing. Testing for these two directions needs to be done independently. The typical order for testing is: use case, test file for input, result file for output. The input file for the test would be created based on the use case and you would manually create the expected XML output file. Then you would do a 'walk through' of the parse using the XSD schema file for reference. Once this looks to be in order you would run the test using the test harness to verify that the results are as expected. After that the test can be added to the test suite and included in automated test runs. It is advantageous to keep test data and files separate from schema files for several reasons. It allows test data to be created independently from schema creation. Anyone could be creating test data files even now before any actual schemas can even be written let alone executed. It also allows simple test cases to be combined. A simple test file containing only an 'int' and a simple test file containing only a 'double' can be concatenated to create a more complext test file. In a lot of cases this lets the more complex test data be created from combinations of simpler test files. All of these tests can then be inventoried and controlled a lot more easily if they are distinct entities. Even for a single DFDL XSD schema file you need to conduct both 'parse' and 'unparse' tests and for each test you need to provide proper input data as well as expected output data. This is necessary in order to perform automated testing and test reporting. The set of tests needed for the complex types (length-prefixed strings, terminated strings, fixed-width strings, various padding combinations and the like) will be similar. Then there will be combination tests. Some test cases with be used by multiple schema files. One schema might provide a default value for a required field (with a test file that is missing the required field) and another might have no default value. The first should parse fine and the second should give an error of some sort. Another reason some test cases need to be shared by multiple XSD files is for thorough testing of files with strings that may or may not contain BOMs.There will be schema files that specify different property settings but process the same set of test files. A test file containing only ASCII should be able to be processed by a schema file that specifies an encoding of US-ASCII, ISO-8859-1 or UTF-8. Some UTF encodings allow, but do not require, a byte order mark (BOM) to be present in the byte stream. These must be detected and accounted for in order to properly position the stream pointer to the character after the string itself. Here's a spec twister for you: if an encoding requires a BOM but there is none in the byte stream is it a parse error? Or, since the XSD writer can specify LITTLE-ENDIAN should it NOT be an error. On unparse a BOM will be generated if required.