Re: [DFDL-WG] issue: scannable and 'results are not predictable'

11 Jul 2013

      There was a time when we disallowed lengthKind='delimited' when 
representation is 'binary'. Binary data can, in general, contain any 
sequence of bytes so it might contain the terminating markup. In other 
words it is not guaranteed to be 'scannable'. We relaxed that rule because 
we found that there are industry formats out there which contain non-text 
delimited fields. In other words, the general rule ( binary data is not 
scannable ) does not always apply in specific formats.

I think that point is relevant to this discussion. Just because the DFDL 
properties indicate that the data is not *guaranteed* to be scannable, 
that does not mean that the actual data is not scannable. I believe we 
should 
- define the term 'scannable'
- acknowledge that when a complex type is not 'scannable' according to the 
definition, the data still might be parse-able in a reliable way
- not prohibit the use of lengthKind='pattern' ( i.e. not issue an SDE ) 
just because the element is not 'scannable'. 

It may well be appropriate for an implementation to issue a warning when 
lengthKind is 'delimited' or 'pattern' and the element's content is not 
'scannable'.

regards,

Tim Kimber, DFDL Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742 
Internal tel. 37246742

From:   Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:     dfdl-wg@ogf.org, 
Date:   10/07/2013 18:22
Subject:        [DFDL-WG] issue: scannable and 'results are not 
predictable'
Sent by:        dfdl-wg-bounces@ogf.org

I was editing the definition of scannable into the glossary and when I 
looked at usage of 'scannable' in testPattern I found this:

In the box for testPattern it says if the data is not scannable "the 
results are not predictable". 

Is that sufficient? 

We can sometimes statically determine that the schema says the data should 
all be scannable (e.g., no change of encoding, no binary elements), and 
that would rule out one non-predictability.  So, if data is non-scannable 
in the sense that the schema contains say, a binary element, we can issue 
an SDE if lengthKind is pattern or a testPattern assert is being used.

We could also SDE if runtime-valued encoding properties are used and the 
encoding changes inside a scannable context.

Well, I guess testKind pattern asserts/discriminators are an issue because 
they may look only at the first part of the data of a complex component, 
so they don't require everything to be scannable, only the part the regex 
actually examines. So in this case it's user-beware, and if non-scannable 
I suppose we could issue a warning.  

But the spec does not say this is an SDE or warning currently. It just 
says results are not predictable. 

There is also the fact that the data might be broken, i.e., the schema 
might say the data is scannable, but at parse time character decode errors 
occur.  I believe our policy on this is that these cause processing 
errors. This really is orthogonal to scannable, which is a property of a 
schema component. 

Comments?

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
--
  dfdl-wg mailing list
  dfdl-wg@ogf.org
  https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU