Comments from Steve and Ian:
1) The subset proposed is basically
lifted from the IBM MRM parser help. If I ever knew what the rationale
for the subset was, I don't know it now. What features have we excluded?
2) IBM MRM parser has extended the xsd
regular expression syntax to allow hexadecimal characters using the following
syntax:
\xNN
| hexadecimal digits in the range 0 to F |
MRM makes much wider use of regular
expressions, as an alternative to speculative parsing, so I can see why
MRM needed this (one concrete use case was for TLOG retail messages). Do
we need to support this in DFDL?
3) If we don't add the hex support,
what are the use cases for using a dfdl:lengthPattern versus using an xsd
pattern facet? It looks like pattern facets apply to all supported
schema simple types, so not clear why dfdl:lengthPattern would be needed.
The only use case I can think of is where we have length on a complex element
or sequence or choice. If this is the only use case perhaps dfdl:lengthPattern
should only be used in those cases? MRM allows this use. (It might
also answer 2 as it allows embedded binary data to appear). Or is
there a distinction between validation and parsing?
4) What is the behaviour on unparsing?
I believe that MRM simply takes the value presented to it and outputs
it (it does not attempt to match it against the pattern), so DFDL equivalent
would be to outout the infoset value.
5) For a repeating element, presumably
we would consume only as match as the number of occurs dictates.
6) Should state explicitly that DFDL
entity references are not allowed. The XML character reference is used
instead &#xNN;
Regards, Steve
Steve Hanson
WebSphere Message Brokers
Hursley, UK
Internet: smh@uk.ibm.com
Phone (+44)/(0) 1962-815848
"Mike Beckerle"
<mbeckerle.dfdl@gmail.com>
Sent by: dfdl-wg-bounces@ogf.org
09/04/2008 01:34
Please respond to
mbeckerle.dfdl@gmail.com |
|
To
| Alan Powell/UK/IBM@IBMGB, <dfdl-wg@ogf.org>
|
cc
|
|
Subject
| Re: [DFDL-WG] DFDL Regular Expression
proposal |
|
Suggest add to “lengthPattern”
that the longest possible match is taken. This is the usual behavior for
regular expressions, but it’s a clarification I’ve seen other places.
From: dfdl-wg-bounces@ogf.org [mailto:dfdl-wg-bounces@ogf.org]
On Behalf Of Alan Powell
Sent: Thursday, April 03, 2008 12:44 PM
To: dfdl-wg@ogf.org
Subject: [DFDL-WG] DFDL Regular Expression proposal
Attached is the proposal for the regular expression syntax used to determine
element length.
Highlights
- Based on the XML Schema regular expression
subset used by WebSphere Message Broker.
- Only applies to representation = text
- Uses LengthPattern property rather than
decorated syntax to distinguish from literals and regular expressions as
it is only used in one place, this avoids everywhere else having to escape
the decoration character and we are running out of decoration characters.
- Assumes the pattern is converted to
the data code page before matching against the data stream.
Comments and improvements as soon as possible please.
Alan Powell
MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell@uk.ibm.com
Tel: +44 (0)1962 815073
Fax: +44 (0)1962 816898
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU