This can be summarised by saying that a performance optimisation by a DFDL implementation should not change a successful parse into a failure (and vice-versa) nor should it change the DFDL infoset if the parse is successful. I think that goes without saying but we could be explicit and add it somewhere.

If an implementation is pre-decoding when parsing, then it needs to be sure that whatever it tries to decode must not go a) beyond the end of the data (possible if streaming input), and must legitimately be in that encoding. If an implementation does some analysis of the schema, and realises that the data will always be entirely UTF-8 text, then pre-coding is a possible optimisation. If the data is a mixture of text and binary then pre-coding would not be a possible optimisation, unless there was also a fallback that the code dropped into after a decode error where it did not pre-decode.

Regards

Steve Hanson

IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday

From: Mike Beckerle <mbeckerle.dfdl@gmail.com>
To: DFDL-WG <dfdl-wg@ogf.org>
Date: 08/10/2018 17:16
Subject: [DFDL-WG] clarification on behavior of DFDL encodingErrorPolicy='error' and pre-decoding by implementations
Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org>

The DFDL spec isn't clear on when encodingErrorPolicy 'error' is allowed to cause an error, and when one must be suppressed, if the implementation pre-decodes data into characters.

Example:

Suppose you have what turns out to be 8 characters of text, followed by some binary data.

Suppose a DFDL implementation happens to always try to fill a buffer of 64 decoded characters, just for efficiency reasons.

Depending on what is in the binary data, this may parse the 8 characters of text without error, but subsequently hit a decode error, because it has strayed into binary data past the text.

There is no actual decode error in the data stream, because parsing should determine there are only 8 characters of text, and then switch to parsing the binary data using binary means.

The DFDL spec doesn't say this isn't allowed to cause a decode error. Perhaps it is implied somewhere? But I didn't find it.

The DFDL spec does point out that for asserts/discriminators with testKind pattern, that pattern matching may cause decode errors. But again, suppose the regex matching library an implementation uses happens to pre-fetch and pre-decode a bunch of characters, but the regex matching library then finds a match that is quite short, and stops well before the characters that were pre-decoded that caused a decode error.

It would seem to me that this sort of pre-decoding should not cause decode errors. but the DFDL spec doesn't state that explicitly.

comments?

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU