
The way I view physical rep information is as functions that can be applied to types and fields. Writing the data out to a blocked/segmented format does not fall into this category. It is an orthogonal operation that applies to the whole data and as such is much more akin to encryption and compression. For example, I have a COBOL structure that ends up in an MQSeries queue and in a QSAM file. It has a logical structure, it has a physical representation. In the QSAM case a further transform has taken place to block/segment the structure. I would not expect to see the physical rep properties of the types and elements change.
I think we've been talking about DFDL as always going TO the XML schema and have considered the process of going FROM the XML to a new serialization as 'inverse DFDL'. Towards that end, we've discussed being able to mark transforms as invertible and/or allowing an inverse method to be registered as part of the transform definition. We also talked about the potential requirement of having multiple output streams: if I read x and y dimensions and then pixels, but my output XML model is just the pixel sequence, I will need to record x and y somewhere to allow inversion, so the user (or DFDL) might want to specify x and y in some separate 'provenance' file that could be used during inversion. I'm not sure that this is the best model, but I don't think we've come up with a good way to describe going from the XML model except as the inverse of the to process.
Mike's idea of a schema level 'stream' rep property sounds ok in principle for parsing, but what other metadata is needed when serialising? How are we informed of the rules for VB blocking or for IMS segmentation? Are they fixed or user-defined? If these rules end up requiring extra metadata at the type/element level then I am not comfortable with this, because we are mixing two sets of physical information.
I think that whatever principles we apply to DFDL including/excluding encryption and compression we should also apply to these formats. What is the current proposal in this area? The cheapest option would be to provide a flexible user-defined transform capability.
We planned to have a user-defined transform capability that would appear in the same way as DFDL-standard transforms. I think one can easily put something like zip into the same format as Alan has done for the basic int from ascii, int from binary transforms, as a byte sequence to byte sequence transform. I think I'd vote for just including zip since it will be used in a number of formats, but one could imagine a user adding a de-pig-latinizer as needed. (Pig latin, and things like run-length encoding are examples we've used to point out that not all compression/encryption type algorithms will run on the raw input stream - both of these require some level of parsing before you can use them - to find words or to get the <value, # of repeats > pairs from the initial bytes.