This can be summarised by saying that a
performance optimisation by a DFDL implementation should not change a successful
parse into a failure (and vice-versa) nor should it change the DFDL infoset
if the parse is successful. I think that goes without saying but we could
be explicit and add it somewhere.
If an implementation is pre-decoding
when parsing, then it needs to be sure that whatever it tries to decode
must not go a) beyond the end of the data (possible if streaming input),
and must legitimately be in that encoding. If an implementation does some
analysis of the schema, and realises that the data will always be entirely
UTF-8 text, then pre-coding is a possible optimisation. If the data is
a mixture of text and binary then pre-coding would not be a possible optimisation,
unless there was also a fallback that the code dropped into after a decode
error where it did not pre-decode.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
DFDL-WG <dfdl-wg@ogf.org>
Date:
08/10/2018 17:16
Subject:
[DFDL-WG] clarification
on behavior of DFDL encodingErrorPolicy='error' and pre-decoding by implementations
Sent by:
"dfdl-wg"
<dfdl-wg-bounces@ogf.org>
The DFDL spec isn't clear on when encodingErrorPolicy
'error' is allowed to cause an error, and when one must be suppressed,
if the implementation pre-decodes data into characters.
Example:
Suppose you have what turns out to be 8 characters of
text, followed by some binary data.
Suppose a DFDL implementation happens to always try to
fill a buffer of 64 decoded characters, just for efficiency reasons.
Depending on what is in the binary data, this may parse
the 8 characters of text without error, but subsequently hit a decode error,
because it has strayed into binary data past the text.
There is no actual decode error in the data stream, because
parsing should determine there are only 8 characters of text, and then
switch to parsing the binary data using binary means.
The DFDL spec doesn't say this isn't allowed to cause
a decode error. Perhaps it is implied somewhere? But I didn't find it.
The DFDL spec does point out that for asserts/discriminators
with testKind pattern, that pattern matching may cause decode errors. But
again, suppose the regex matching library an implementation uses happens
to pre-fetch and pre-decode a bunch of characters, but the regex matching
library then finds a match that is quite short, and stops well before the
characters that were pre-decoded that caused a decode error.
It would seem to me that this sort of pre-decoding should
not cause decode errors. but the DFDL spec doesn't state that explicitly.
comments?
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology
| www.tresys.com
Please note: Contributions to the DFDL Workgroup's email
discussions are subject to the OGF
Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU