[DFDL-WG] proposed wording: scanable and 'results are not predictable' improvement - was Fwd: issue: scannable and 'results are not predictable'

23 Jul 2013

      The upshot of this whole thread is that we need to fix the description of
testPattern for asserts/discriminators. That motivates another correction
in the prose description of encodingErrorPolicy.

Proposed rewording:

This paragraph in the testPattern description for asserts/discriminators:

   - In order for a testPattern to be used, the data subject to the pattern
   must be scannable using a DFDL regular expression otherwise the results are
   not predictable.

Change to:

   - In order for a testPattern to be used, the data subject to the pattern
   must be scannable using a DFDL regular expression. If the pattern regular
   expression reads data that cannot be decoded into characters of the current
   encoding, then the behavior is controlled by the dfdl:encodingErrorPolicy
   property. See Section 11.2.1    Property dfdl:encodingErrorPolicy for
   details.

In addition, consider the paragraph in section 11.2.1.2

   -

   The Unicode Replacement Character must not appear in any delimiter, pad
   character, nil value, regular expression, number pattern or calendar
   pattern, or in any other DFDL property value where the Unicode Replacement
   Character would be expected in the data being parsed. It is a schema
   definition error if the Unicode Replacement Character appears in any of
   these locations of a DFDL schema, or is part of the value of an expression
   that returns a string to be used as the value of a DFDL property.

I believe the above paragraph is a mistake. It precludes a very useful
technique which is to use a negated character class in the regex like
[^\uFFFD] This regex searches for any character except the unicode
replacement character which is very useful. I suggest the above paragraph
be dropped.

This sentence (same section) can be modified:

   - Schema authors are advised that bounded length regular expressions can
   help in this case. E.g., ".{0,50}" says to match any character (including
   Unicode Replacement Characters), but only up to length 50.

Change to:

   - Schema authors are advised that bounded length regular expressions and
   negated character classes can improve the schema. E.g., "[^\uFFFD].{0,50}"
   says to match any character (*excluding *specifically Unicode
   Replacement Characters U+FFFD), but only up to length 50.

---------- Forwarded message ----------
From: Steve Hanson <smh@uk.ibm.com>
Date: Tue, Jul 23, 2013 at 9:02 AM
Subject: Re: [DFDL-WG] issue: scannable and 'results are not predictable'
To: Mike Beckerle <mbeckerle.dfdl@gmail.com>
Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Tim Kimber <KIMBERT@uk.ibm.com
...
Mike, would you like to attempt some words to improve the 'results not
predictable' sentence?

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
IBM SWG, Hursley, UK*
**smh@uk.ibm.com* <smh@uk.ibm.com>
tel:+44-1962-815848

From:        Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:        Steve Hanson/UK/IBM@IBMGB,
Cc:        Tim Kimber/UK/IBM@IBMGB, dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org
Date:        11/07/2013 18:18
Subject:        Re: [DFDL-WG] issue: scannable and 'results are not
predictable'
------------------------------

Ok. let me summarize. I think this is clear now:

lengthKind pattern requires everything to be statically known to be text.
Fake binary data as iso-8859-1 text since that passes all bytes.

assert pattern just tries to decode in current encoding. Binary data might
cause decode errors or might not depending on what is in the actual data.
Use encoding iso-8859-1 to preclude this possibility, or
user-beware/know-the-data.

The behavior of decode errors is controlled by encodingErrorPolicy, which
clearly states that if the policy is 'error' then a processing error is
issued. It specifically states that this applies in all situations
including lengthKind pattern, and pattern asserts. The description does not
leave any wiggle room here.

(There's a separate email thread on asserts with expressions that get
errors but it does not yet discuss pattern asserts.)

So the wording that says "results are not predictable" should instead
explain and provide a reference to the description of encodingErrorPolicy.

On Thu, Jul 11, 2013 at 5:57 AM, Steve Hanson
<*smh@uk.ibm.com*<smh@uk.ibm.com>>
wrote:
In the original DFDL 1.0 spec this is what we used to say about lengthKind
'pattern'.
*12.3.5.1        Pattern-Based Lengths  - Scanability*

*Any element (complex, simple text, simple binary) may have a
dfdl:lengthKind 'pattern' as long as the bytes in the content region of the
element **are legal in the stated encoding of that element. Where a complex
element has children with binary representation in practice this means an
8-bit ASCII encoding.

Binary data can be handled by way of treating it as text with
encoding='iso-8859-1'. In this case the text is interpreted as in the
iso-8859-1 character encoding, and the correspondence of byte values in the
data to a string in the DFDL infoset is one to one. That is, byte with
value N, produces an infoset character with character code N.*

This was changed by errata 3.9 back in the 4th revision of the Errata
document. At the time, the same limit was applied to asserts &
discriminators as well. Here is the original errata wording.
*
3.9.** Section 12.3.5.1. The spec currently allows lengthKind ‘pattern’ to
be used when the representation of the current element, or of a child
element, is binary, but imposes restrictions on the encoding that can be in
force. However encoding is not necessarily examined for binary elements, so
this would introduce another reason for needing encoding.*
*
Change the spec so that lengthKind ‘pattern’ is only applicable **
o        **elements of simple type with representation 'text'* *
o        **elements of complex type *
*
For an element of complex type:* *
1.        all simple child elements must have representation 'text' and
have the same encoding as the parent complex element, and* *
2.        all complex child elements must themselves follow 1 and 2
(recursively). *
*
Similar wording to apply to dfdl:assert testKind="pattern" in section 7.3.1.
*

In the 11th revision of the Errata document, the last sentence was changed
to...
*
Note that the same restrictions do not** apply to testKind="pattern" on
asserts and discriminators*

This was done because an assert or discriminator with testKind 'pattern' is
peeking ahead into the data stream from the start of the representation of
the object (element / sequence / choice). This was recorded by action 190.

  *190*
*Clarify rules for assert/discriminator testKind 'pattern' (All)* *
23/10: Need to be clear on data position and whether it is just for text
representations.  * *
30/10: Closed.** To comply with the timing rules being proposed in action
186, where these things are executed first before a 'format' annotation,
the data position must be the beginning of the representation (note warning
useful when alignment present). As these things can be used on various
objects, the only rule regarding text is that dfdl:encoding must have a
value in scope. Errata taken**.*

Personally I am happy for DFDL 1.0 to stick with the current errata, and
improve the wording in the testPattern description.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/>
IBM SWG, Hursley, UK*
**smh@uk.ibm.com* <smh@uk.ibm.com>
tel:*+44-1962-815848* <%2B44-1962-815848>

From:        Tim Kimber/UK/IBM@IBMGB
To:        *dfdl-wg@ogf.org* <dfdl-wg@ogf.org>,
Date:        11/07/2013 10:08
Subject:        Re: [DFDL-WG] issue: scannable and 'results are not
predictable'
Sent by:        *dfdl-wg-bounces@ogf.org* <dfdl-wg-bounces@ogf.org>
------------------------------

There was a time when we disallowed lengthKind='delimited' when
representation is 'binary'. Binary data can, in general, contain any
sequence of bytes so it might contain the terminating markup. In other
words it is not guaranteed to be 'scannable'. We relaxed that rule because
we found that there are industry formats out there which contain non-text
delimited fields. In other words, the general rule ( binary data is not
scannable ) does not always apply in specific formats.

I think that point is relevant to this discussion. Just because the DFDL
properties indicate that the data is not *guaranteed* to be scannable, that
does not mean that the actual data is not scannable. I believe we should
- define the term 'scannable'
- acknowledge that when a complex type is not 'scannable' according to the
definition, the data still might be parse-able in a reliable way
- not prohibit the use of lengthKind='pattern' ( i.e. not issue an SDE )
just because the element is not 'scannable'.

It may well be appropriate for an implementation to issue a warning when
lengthKind is 'delimited' or 'pattern' and the element's content is not
'scannable'.

regards,

Tim Kimber, DFDL Team,
Hursley, UK
Internet:  *kimbert@uk.ibm.com* <kimbert@uk.ibm.com>
Tel. 01962-816742
Internal tel. 37246742

From:        Mike Beckerle <*mbeckerle.dfdl@gmail.com*<mbeckerle.dfdl@gmail.com>
...
To:        *dfdl-wg@ogf.org* <dfdl-wg@ogf.org>,
Date:        10/07/2013 18:22
Subject:        [DFDL-WG] issue: scannable and 'results are not predictable'
Sent by:        *dfdl-wg-bounces@ogf.org* <dfdl-wg-bounces@ogf.org>
 ------------------------------

I was editing the definition of scannable into the glossary and when I
looked at usage of 'scannable' in testPattern I found this:

In the box for testPattern it says if the data is not scannable "the
results are not predictable".

Is that sufficient?

We can sometimes statically determine that the schema says the data should
all be scannable (e.g., no change of encoding, no binary elements), and
that would rule out one non-predictability.  So, if data is non-scannable
in the sense that the schema contains say, a binary element, we can issue
an SDE if lengthKind is pattern or a testPattern assert is being used.

We could also SDE if runtime-valued encoding properties are used and the
encoding changes inside a scannable context.

Well, I guess testKind pattern asserts/discriminators are an issue because
they may look only at the first part of the data of a complex component, so
they don't require everything to be scannable, only the part the regex
actually examines. So in this case it's user-beware, and if non-scannable I
suppose we could issue a warning.

But the spec does not say this is an SDE or warning currently. It just says
results are not predictable.

There is also the fact that the data might be broken, i.e., the schema
might say the data is scannable, but at parse time character decode errors
occur.  I believe our policy on this is that these cause processing errors.
This really is orthogonal to scannable, which is a property of a schema
component.

Comments?

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *
www.tresys.com* <http://www.tresys.com/>
--
dfdl-wg mailing list*
**dfdl-wg@ogf.org* <dfdl-wg@ogf.org>*
**https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU--
 dfdl-wg mailing list
 *dfdl-wg@ogf.org* <dfdl-wg@ogf.org>
 *https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

--
  dfdl-wg mailing list
  *dfdl-wg@ogf.org* <dfdl-wg@ogf.org>
  *https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | *
www.tresys.com* <http://www.tresys.com/>

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com