In the original DFDL 1.0 spec this is what
we used to say about lengthKind 'pattern'.
12.3.5.1 Pattern-Based
Lengths - Scanability
Any element (complex, simple text, simple
binary) may have a dfdl:lengthKind 'pattern' as long as the bytes in the
content region of the element are legal in the stated encoding
of that element. Where a complex element has children with binary representation
in practice this means an 8-bit ASCII encoding.
Binary data can be handled by way of treating it as text with encoding='iso-8859-1'.
In this case the text is interpreted as in the iso-8859-1 character encoding,
and the correspondence of byte values in the data to a string in the DFDL
infoset is one to one. That is, byte with value N, produces an infoset
character with character code N.
This was changed by errata 3.9 back
in the 4th revision of the Errata document. At the time, the same limit
was applied to asserts & discriminators as well. Here is the original
errata wording.
3.9. Section 12.3.5.1.
The spec currently allows lengthKind ‘pattern’ to be used when the representation
of the current element, or of a child element, is binary, but imposes restrictions
on the encoding that can be in force. However encoding is not necessarily
examined for binary elements, so this would introduce another reason for
needing encoding.
Change the spec so that lengthKind
‘pattern’ is only applicable
o elements
of simple type with representation 'text'
o elements
of complex type
For an element of complex type:
1. all
simple child elements must have representation 'text' and have the same
encoding as the parent complex element, and
2. all
complex child elements must themselves follow 1 and 2 (recursively).
Similar wording to apply to dfdl:assert
testKind="pattern" in section 7.3.1.
In the 11th revision of the Errata document,
the last sentence was changed to...
Note that the same restrictions do
not apply to testKind="pattern" on asserts and
discriminators
This was done because an assert or discriminator
with testKind 'pattern' is peeking ahead into the data stream from the
start of the representation of the object (element / sequence / choice).
This was recorded by action 190.
190
| Clarify
rules for assert/discriminator testKind 'pattern' (All)
23/10: Need to be clear on data position
and whether it is just for text representations.
30/10: Closed. To comply
with the timing rules being proposed in action 186, where these things
are executed first before a 'format' annotation, the data position must
be the beginning of the representation (note warning useful when alignment
present). As these things can be used on various objects, the only rule
regarding text is that dfdl:encoding must have a value in scope. Errata
taken. |
Personally I am happy for DFDL 1.0 to
stick with the current errata, and improve the wording in the testPattern
description.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
Tim Kimber/UK/IBM@IBMGB
To:
dfdl-wg@ogf.org,
Date:
11/07/2013 10:08
Subject:
Re: [DFDL-WG]
issue: scannable and 'results are not predictable'
Sent by:
dfdl-wg-bounces@ogf.org
There was a time when we disallowed
lengthKind='delimited' when representation is 'binary'. Binary data can,
in general, contain any sequence of bytes so it might contain the terminating
markup. In other words it is not guaranteed to be 'scannable'. We relaxed
that rule because we found that there are industry formats out there which
contain non-text delimited fields. In other words, the general rule ( binary
data is not scannable ) does not always apply in specific formats.
I think that point is relevant to this discussion. Just because the DFDL
properties indicate that the data is not *guaranteed* to be scannable,
that does not mean that the actual data is not scannable. I believe we
should
- define the term 'scannable'
- acknowledge that when a complex type is not 'scannable' according to
the definition, the data still might be parse-able in a reliable way
- not prohibit the use of lengthKind='pattern' ( i.e. not issue an SDE
) just because the element is not 'scannable'.
It may well be appropriate for an implementation to issue a warning when
lengthKind is 'delimited' or 'pattern' and the element's content is not
'scannable'.
regards,
Tim Kimber, DFDL Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 37246742
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: dfdl-wg@ogf.org,
Date: 10/07/2013
18:22
Subject: [DFDL-WG]
issue: scannable and 'results are not predictable'
Sent by: dfdl-wg-bounces@ogf.org
I was editing the definition of scannable into the glossary and when I
looked at usage of 'scannable' in testPattern I found this:
In the box for testPattern it says if the data is not scannable "the
results are not predictable".
Is that sufficient?
We can sometimes statically determine that the schema says the data should
all be scannable (e.g., no change of encoding, no binary elements), and
that would rule out one non-predictability. So, if data is non-scannable
in the sense that the schema contains say, a binary element, we can issue
an SDE if lengthKind is pattern or a testPattern assert is being used.
We could also SDE if runtime-valued encoding properties are used and the
encoding changes inside a scannable context.
Well, I guess testKind pattern asserts/discriminators are an issue because
they may look only at the first part of the data of a complex component,
so they don't require everything to be scannable, only the part the regex
actually examines. So in this case it's user-beware, and if non-scannable
I suppose we could issue a warning.
But the spec does not say this is an SDE or warning currently. It just
says results are not predictable.
There is also the fact that the data might be broken, i.e., the schema
might say the data is scannable, but at parse time character decode errors
occur. I believe our policy on this is that these cause processing
errors. This really is orthogonal to scannable, which is a property of
a schema component.
Comments?
--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU