In analysis of the valueLength and contentLength email discussion threads, we're not converged on whether these functions when measured in units 'characters', are allowed to compute the length without checking for decode errors when (a) the encoding is fixed width - so a character unit is just an alias for some number of bytes (b) encodingErrorPolicy='error'.

I think we need to clarify that just because encodingErrorPolicy is 'error', doesn't mean all data in scope of that will be scanned to be sure there are no decode errors.

Other features with similar issues are regex pattern asserts. In this case the regex is matching against text, and that match might or might not encounter a decode error, but the entire scope of data it's talking about is NOT going to get converted just to insure no chance of a decode error.

Can we go so far as to say DFDL implementations can (or maybe even must) optimize performance by avoiding character decoding when possible? This means that some character decode errors may not be detected even though dfdl:encodingErrorPolicy is 'error'.

I would suggest the language should say that only character decoding that results in a character being placed into the DFDL Infoset is guaranteed to cause an error should that character not be decodable. (Similarly for unparsing, it is only if we actually unparse an unmapped character from the infoset to the output stream, that a encoding error is guaranteed to occur.

I might not have worded it well, but I think the above is what we're trying to allow - implementations are free to exploit fixed character width, and just jumping around the bytes and not decoding/encoding anything - whenever they can, because we all expect, and think our users will expect, this level of performance.

If you use UTF-8, certainly a common thing,( or any other variable-width encoding) then you are likely to get some cases where implementations say "can't do that with utf-8" because it's just a limitation of the implementation. One may also get cases where switching from utf-16 to utf-8 for your data changes the behavior of the processor because utf-16 wouldn't detect some decoding errors because of fixed width, whereas when using utf-8 will have to measure length in characters by decoding and so will detect the error.

Basically, we've tried to make a consistent position around a messy area: The schema contains a complex type element which is a mixture of character encodings, binary stuff, and may have decode errors in the corresponding data stream (or characters in the infoset that have no mapping into the representation of the encoding - for unparsing).

Given this mess, there are ways that a schema can look at it, and foolishly-perhaps, treat it as characters. Asserts with test patterns are one. Specified length with units of 'characters' is another, and the dfdl:contentLength and dfdl:valueLength functions is yet another, since one can specify the units 'characters' as that 2nd argument.

The only consistent position is that dfdl:contentLength or dfdl:valueLength of an element, with units 'character' does NOT necessarily imply those characters will be decoded/encoded.

Anyway, it's moot in the scenario where an earlier OVC wants the dfdl:contentLength of something later,.... if when we unparse the thing later, we get the decode error at that point. We're just getting what is arguably an incorrect OVC computation, but followed by a later decode error. ....

Comments?

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com

Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy