From: | Steve Hanson/UK/IBM |
To: | Tim Kimber/UK/IBM@IBMGB, mbeckerle.dfdl@gmail.com, |
Date: | 20/09/2011 10:17 |
Subject: | Re: Action 148: pattern based lengths - suggested revised language |
From: | Tim Kimber/UK/IBM |
To: | Steve Hanson/UK/IBM@IBMGB |
Date: | 25/08/2011 16:23 |
Subject: | Re: Action 148: pattern based lengths - suggested revised language |
From: | Mike Beckerle <mbeckerle.dfdl@gmail.com> |
To: | Steve Hanson/UK/IBM@IBMGB |
Date: | 27/07/2011 15:00 |
Subject: | Re: pattern based lengths - suggested revised language |
On Jul 27, 2011 5:53 AM, "Steve Hanson" <smh@uk.ibm.com>
wrote:
Hi Mike
I don't think we can reduce the wording that much. The second paragraph
is needed because it covers the binary case, where encoding is not actually
used.
I think we either need to be conservative and disallow the combination
of binary & pattern, or leave the second paragraph as-is and effectively
say that if you use binary with pattern then that is the behaviour.
If we are to be conservative then:
- For a simple element or simple type, disallow lengthKind="pattern"
with binary rep.
- For a complex element with lengthKind = "pattern", all children
must have lengthUnits = "characters" (so text only) and the encoding
of the children must be the same as the encoding of the parent. (We already
have a similar rule for complex elements with specified length and lengthUnits
= "characters").
We also allow asserts and discriminators to carry patterns which are applied straight at the current position in the data stream. It would be difficult to police the conservative rules here. But we need to say what encoding is used and we currently do not. I would say it must be the encoding of the element or group that carries the assert/discriminator.
I said on the call that we had extended DFDL regular expressions so that raw hex bytes could be specified. However I don't see any evidence of this in the DFDL spec. This facility was something we added to IBM MRM for a retail format called TLOG which consists of delimited packed decimal data with hex indicator bytes, so we needed a way to match the hex indicator bytes as part of the regexp. However, I think this was only necessary because MRM has neither speculation nor discriminators, and in a DFDL version of TLOG I would use a discriminator. So I think my statement was in error, and I don't believe raw hex in DFDL regexps is needed.
Regards
Steve Hanson
Architect, Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
>
>
> From:
> "Mike Beckerle" <mbeckerle.dfdl@gmail.com>
> To:
> Steve Hanson/UK/IBM@IBMGB
> Date:
> 26/07/2011 17:30
> Subject:
> pattern based lengths - suggested revised language
>
>
>
> I suggest this language to tighten up this whole section (replace
both
> paragraphs). Given the concerns of Tim, that we make sure DFDL
> implementations don’t have to reimplement regexp matching, I think
this is
> sufficient.
> 1.1.1.1 Based Lengths - Scanability
> Any element (complex, simple text, simple binary) may have a
> dfdl:lengthKind 'pattern'. When an element contains binary data, and
> lengthKind=’pattern’ is used, then it is a schema definition error
if the
> character set encoding is not iso-8859-1.
>
>
> (Possible generalization 1: allow other character sets, e.g., iso-8859-15
> as well. This is ok because 8859-15 still maps all 256 codepoints.
But
> this is a slippery slope. )
>
> (Possible generalization 2: allow any character set, Ascii, ebcdic,
> utf-16be, etc. Note that using any character encoding other than one
which
> maps a valid character to any 8-bit byte creates ambiguity: e.g, the
> regexp “.” is one where we normally think it means “any character”.
But
> do we really mean “any byte” ? If the character set encoding doesn’t
have
> a given byte as a codepoint, then this question really matters.)
>
>
>
>
>
>
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with
number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
PO6 3AU
>
>
>
>
>
>
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU