In analysis of the valueLength and contentLength email discussion threads, we're not converged on whether these functions when measured in units 'characters', are allowed to compute the length
without checking for decode errors when (a) the encoding is fixed width -
so a character unit is just an alias for some number of bytes (b)
encodingErrorPolicy='error'.
I think we need to
clarify that just because encodingErrorPolicy is 'error', doesn't mean
all data in scope of that will be scanned to be sure there are no decode
errors.
Other features with similar issues are regex
pattern asserts. In this case the regex is matching against text, and
that match might or might not encounter a decode error, but the entire
scope of data it's talking about is NOT going to get converted just to
insure no chance of a decode error.
Can
we go so far as to say DFDL implementations can (or maybe even must)
optimize performance by avoiding character decoding when possible? This
means that some character decode errors may not be detected even though
dfdl:encodingErrorPolicy is 'error'.
I would suggest the language should say that only character decoding that
results in a character being placed into the DFDL Infoset is guaranteed
to cause an error should that character not be decodable. (Similarly
for unparsing, it is only if we actually unparse an unmapped character
from the infoset to the output stream, that a encoding error is
guaranteed to occur.
I might not have
worded it well, but I think the above is what we're trying to allow -
implementations are free to exploit fixed character width, and just
jumping around the bytes and not decoding/encoding anything - whenever
they can, because we all expect, and think our users will expect, this
level of performance.
If you use UTF-8, certainly a
common thing,( or any other variable-width encoding) then you are likely to get
some cases where implementations say "can't do that with utf-8" because
it's just a limitation of the implementation. One may also get cases
where switching from utf-16 to utf-8 for your data changes the behavior
of the processor because utf-16 wouldn't detect some decoding errors because of fixed width, whereas
when using utf-8 will have to measure length in characters by decoding
and so will detect the error.
Basically,
we've tried to make a consistent position around a messy area: The
schema contains a complex type element which is a mixture of character
encodings, binary stuff, and may have decode errors in the corresponding
data stream (or characters in the infoset that have no mapping into the
representation of the encoding - for unparsing).
Given
this mess, there are ways that a schema can look at it, and
foolishly-perhaps, treat it as characters. Asserts with test patterns
are one. Specified length with units of 'characters' is another, and the
dfdl:contentLength and dfdl:valueLength functions is yet another, since
one can specify the units 'characters' as that 2nd argument.
The
only consistent position is that dfdl:contentLength or dfdl:valueLength
of an element, with units 'character' does NOT necessarily imply those
characters will be decoded/encoded.
Anyway,
it's moot in the scenario where an earlier OVC wants the
dfdl:contentLength of something later,.... if when we unparse the thing
later, we get the decode error at that point. We're just getting what is
arguably an incorrect OVC computation, but followed by a later decode
error. ....
Comments?