From: | Mike Beckerle <mbeckerle.dfdl@gmail.com> |
To: | Steve Hanson/UK/IBM@IBMGB |
Date: | 27/07/2011 15:00 |
Subject: | Re: pattern based lengths - suggested revised language |
On Jul 27, 2011 5:53 AM, "Steve Hanson" <smh@uk.ibm.com>
wrote:
> Hi Mike
>
> I don't think we can reduce the wording that much. The second paragraph
> is needed because it covers the binary case, where encoding is not
> actually used.
>
> I think we either need to be conservative and disallow the combination
of
> binary & pattern, or leave the second paragraph as-is and effectively
say
> that if you binary with pattern then that is the behaviour.
>
> If we are to be conservative then:
>
> - For a simple element or simple type, disallow lengthKind="pattern"
with
> binary rep.
>
> - For a complex element with lengthKind = "pattern", all
children must
> have lengthUnits = "characters" (so text only) and the encoding
of the
> children must be the same as the encoding of the parent. (We already
have
> a similar rule for complex elements with specified length and lengthUnits
> = "characters").
> We also allow asserts and discriminators to carry patterns which are
> applied straight at the current position in the data stream. It would
be
> difficult to police the conservative rules here. But we need to say
what
> encoding is used and we currently do not. I would say it must be the
> encoding of the element or group that carries the assert/discriminator.
> I said on the call that we had extended DFDL regular expressions so
that
> raw hex bytes could be specified. However I don't see any evidence
of this
> in the DFDL spec. This facility was something we added to IBM MRM
for a
> retail format called TLOG which consists of delimited packed decimal
data
> with hex indicator bytes, so we needed a way to match the hex indicator
> bytes as part of the regexp. However, I think this was only necessary
> because MRM has neither speculation nor discriminators, and in a DFDL
> version of TLOG I would use a discriminator. So I think my statement
was
> in error, and I don't believe raw hex in DFDL regexps is needed.
> Regards
>
> Steve Hanson
> Architect, Data Format Description Language (DFDL)
> Co-Chair, OGF DFDL Working Group
> IBM SWG, Hursley, UK
> smh@uk.ibm.com
> tel:+44-1962-815848
>
>
>
> From:
> "Mike Beckerle" <mbeckerle.dfdl@gmail.com>
> To:
> Steve Hanson/UK/IBM@IBMGB
> Date:
> 26/07/2011 17:30
> Subject:
> pattern based lengths - suggested revised language
>
>
>
> I suggest this language to tighten up this whole section (replace
both
> paragraphs). Given the concerns of Tim, that we make sure DFDL
> implementations don’t have to reimplement regexp matching, I think
this is
> sufficient.
> 1.1.1.1 Based Lengths - Scanability
> Any element (complex, simple text, simple binary) may have a
> dfdl:lengthKind 'pattern'. When an element contains binary data, and
> lengthKind=’pattern’ is used, then it is a schema definition error
if the
> character set encoding is not iso-8859-1.
>
>
> (Possible generalization 1: allow other character sets, e.g., iso-8859-15
> as well. This is ok because 8859-15 still maps all 256 codepoints.
But
> this is a slippery slope. )
>
> (Possible generalization 2: allow any character set, Ascii, ebcdic,
> utf-16be, etc. Note that using any character encoding other than one
which
> maps a valid character to any 8-bit byte creates ambiguity: e.g, the
> regexp “.” is one where we normally think it means “any character”.
But
> do we really mean “any byte” ? If the character set encoding doesn’t
have
> a given byte as a codepoint, then this question really matters.)
>
>
>
>
>
>
>
>
>
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with
number
> 741598.
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire
PO6 3AU
>
>
>
>
>
>
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU