[DFDL-WG] Fw: Action 148: pattern based lengths - suggested revised language

20 Sep 2011

      For discussion on today's call...

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 20/09/2011 11:01 -----

From:
Steve Hanson/UK/IBM
To:
Tim Kimber/UK/IBM@IBMGB, mbeckerle.dfdl@gmail.com, 
Date:
20/09/2011 10:17
Subject:
Re: Action 148: pattern based lengths - suggested revised language

I'd like to discuss on the WG call today.   I think the conservative 
approach I outline below is consistent with what we do for complex 
elements and specified lengths, and I'd prefer to stick with that for 1.0.

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From:
Tim Kimber/UK/IBM
To:
Steve Hanson/UK/IBM@IBMGB
Date:
25/08/2011 16:23
Subject:
Re: Action 148: pattern based lengths - suggested revised language

I think we can afford to be a little less conservative, actually. Let's 
suppose that we allow patterns regardless of dfdl:representation and 
regardless of the encoding. That will provide users with maximum 
flexibility, at the ( not very large ) risk that they will occasionally do 
something silly. We can put a note into the specification to the effect 
that patterns should usually be used only with character data, but can ( 
with care ) be used to match bytes if that is the only way to achieve the 
desired result. I may be missing something, but I don't see what harm we 
can cause ourselves or our users by doing this.

My concern is that we could take away a lot of the power that patterns 
provide ( particularly for discriminators / asserts ) and then end up 
regretting it when some strange format pops up. 

regards,

Tim Kimber, Common Transformation Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742 
Internal tel. 246742

From:   Steve Hanson/UK/IBM
To:     Tim Kimber/UK/IBM@IBMGB
Date:   25/08/2011 15:00
Subject:        Action 148: pattern based lengths - suggested revised 
language

Hi Tim

Please could you have a think about my conservative proposal below? 

Firstly, can we get away with restricting patterns to text, or will we 
need to use patterns to grab large amounts of data they may include binary 
content?

Secondly, are we able to apply the same validation criteria to use of 
testKind pattern on an assert or discriminator as we are to use of 
lengthKind pattern? .

Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 25/08/2011 14:44 -----

From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB
Date:
27/07/2011 15:00
Subject:
Re: pattern based lengths - suggested revised language

I support what you call the conservative approach. I.e. require text when 
patterns are used.

Mike
-------------------------------------------------------------------------------------------
On Jul 27, 2011 5:53 AM, "Steve Hanson" <smh@uk.ibm.com> wrote:
Hi Mike

I don't think we can reduce the wording that much. The second paragraph is 
needed because it covers the binary case, where encoding is not actually 
used.

I think we either need to be conservative and disallow the combination of 
binary & pattern, or leave the second paragraph as-is and effectively say 
that if you use binary with pattern then that is the behaviour. 

If we are to be conservative then: 
- For a simple element or simple type, disallow lengthKind="pattern" with 
binary rep.

- For a complex element with lengthKind = "pattern", all children must 
have lengthUnits = "characters" (so text only) and the encoding of the 
children must be the same as the encoding of the parent. (We already have 
a similar rule for complex elements with specified length and lengthUnits 
= "characters"). 
We also allow asserts and discriminators to carry patterns which are 
applied straight at the current position in the data stream. It would be 
difficult to police the conservative rules here. But we need to say what 
encoding is used and we currently do not. I would say it must be the 
encoding of the element or group that carries the assert/discriminator. 
I said on the call that we had extended DFDL regular expressions so that 
raw hex bytes could be specified. However I don't see any evidence of this 
in the DFDL spec. This facility was something we added to IBM MRM for a 
retail format called TLOG which consists of delimited packed decimal data 
with hex indicator bytes, so we needed a way to match the hex indicator 
bytes as part of the regexp. However, I think this was only necessary 
because MRM has neither speculation nor discriminators, and in a DFDL 
version of TLOG I would use a discriminator. So I think my statement was 
in error, and I don't believe raw hex in DFDL regexps is needed. 
Regards

Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
...
From:
"Mike Beckerle" <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB
Date:
26/07/2011 17:30
Subject:
pattern based lengths - suggested revised language
I suggest this language to tighten up this whole section (replace both 
paragraphs). Given the concerns of Tim, that we make sure DFDL 
implementations don’t have to reimplement regexp matching, I think this
...
sufficient.
1.1.1.1 Based Lengths - Scanability
Any element (complex, simple text, simple binary) may have a 
dfdl:lengthKind 'pattern'. When an element contains binary data, and 
lengthKind=’pattern’ is used, then it is a schema definition error if
is 
the
...
character set encoding is not iso-8859-1.
(Possible generalization 1: allow other character sets, e.g., 
iso-8859-15 
as well. This is ok because 8859-15 still maps all 256 codepoints. But 
this is a slippery slope. )
(Possible generalization 2: allow any character set, Ascii, ebcdic, 
utf-16be, etc. Note that using any character encoding other than one 
which 
maps a valid character to any 8-bit byte creates ambiguity: e.g, the 
regexp “.” is one where we normally think it means “any character”. But 
do we really mean “any byte” ? If the character set encoding doesn’t 
have 
a given byte as a codepoint, then this question really matters.)
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
...
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU