Re: [DFDL-WG] DFDL Regular Expression proposal

Steve 1) A good summary of full XML schema regular expressions is here http://www.xmlschemareference.com/regularExpression.html Quick summary of subset XML Schema - Regular Expression - Meta Characters - all supported Normal Characters - no restrictions Single Character Escape Sequence - all supported Multiple Character Escape Sequences - Multiple Character Escape Sequences Description . Any character except '\n' (newline) and '\r' (return). \s Whitespace, specifically '' (space), '\t' (tab), '\n' (newline) and '\r' (return). \S Any character except those matched by '\s'. \i The first character in an XML identifier. Specifically, any letter, the character '_', or the character ':', See the XML Recommendation for the complex specification of a letter. This character represents a subset of letter that might appear in '\c'. \I Any character except those matched by '\i'. \c Any character that might appear in the built-in NMTOKEN datatype. See the XML Recommendation for the complex specification of a NameChar. \C Any character except those matched by '\c'. \d Any Decimal digit. A shortcut for '\p{Nd}'. \D Any character except those matched by '\d'. \w Any character that might appear in a word. A shortcut for '[#X0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]' (all characters except the set of "punctuation", "separator", and "other" characters). \W Any character except those matched by '\w'. Character Categories Character Category Description Notes L Letter, Any Lu Letter, Uppercase Ll Letter, Lowercase Lt Letter, Titlecase Lm Letter, Modifier Lo Letter, Other L Letter, uppercase, lowercase, and titlecase letters (Lu, Ll, and Lt) Optional in The Unicode Standard; not supported by the Schema Recommendation. M Mark, Any Mn Mark, Nonspacing Mc Mark, Spacing Combining Me Mark, Enclosing N Number, Any Nd Number, Decimal Digit Nl Number, Letter No Number, Other P Punctuation, Any Pc Punctuation, Connector Pd Punctuation, Dash Ps Punctuation, Open Pe Punctuation, Close Pi Punctuation, Initial quote (may behave like Ps or Pe, depending on usage) Pf Punctuation, Final quote (may behave like Ps or Pe, depending on usage) Po Punctuation, Other S Symbol, Any Sm Symbol, Math Sc Symbol, Currency Sk Symbol, Modifier So Symbol, Other Z Separator, Any Zs Separator, Space Zl Separator, Line Zp Separator, Paragraph C Other, Any Cc Other, Control Cf Other, Format Cs Other, Surrogate (not supported by Schema Recommendation). Explicitly not supported by Schema Recommendation. Co Other, Private Use Cn Other, Not Assigned (no characters in the file have this property). Character Blocks - None supported (only really meaningful in unicode) XML Character References - supported 2) - 6) see below Alan Powell MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England Notes Id: Alan Powell/UK/IBM email: alan_powell@uk.ibm.com Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898 From: Steve Hanson/UK/IBM To: Alan Powell/UK/IBM Cc: dfdl-wg@ogf.org, mbeckerle.dfdl@gmail.com Date: 09/04/2008 13:58 Subject: Re: [DFDL-WG] DFDL Regular Expression proposal Comments from Steve and Ian: 1) The subset proposed is basically lifted from the IBM MRM parser help. If I ever knew what the rationale for the subset was, I don't know it now. What features have we excluded? 2) IBM MRM parser has extended the xsd regular expression syntax to allow hexadecimal characters using the following syntax: \xNN hexadecimal digits in the range 0 to F MRM makes much wider use of regular expressions, as an alternative to speculative parsing, so I can see why MRM needed this (one concrete use case was for TLOG retail messages). Do we need to support this in DFDL? Is this documented? I think we need to allow hex characters 3) If we don't add the hex support, what are the use cases for using a dfdl:lengthPattern versus using an xsd pattern facet? It looks like pattern facets apply to all supported schema simple types, so not clear why dfdl:lengthPattern would be needed. The only use case I can think of is where we have length on a complex element or sequence or choice. If this is the only use case perhaps dfdl:lengthPattern should only be used in those cases? MRM allows this use. (It might also answer 2 as it allows embedded binary data to appear). Or is there a distinction between validation and parsing? The xsd:pattern operates on the logical contents, Lengthpattern operates on the physical contents including markup. 4) What is the behaviour on unparsing? I believe that MRM simply takes the value presented to it and outputs it (it does not attempt to match it against the pattern), so DFDL equivalent would be to outout the infoset value. Agree 5) For a repeating element, presumably we would consume only as match as the number of occurs dictates. Good question. I had assumed lengthPattern had the same semantics as length. 6) Should state explicitly that DFDL entity references are not allowed. The XML character reference is used instead NN; Need to support DFDL entities to allow x00 Regards, Steve Steve Hanson WebSphere Message Brokers Hursley, UK Internet: smh@uk.ibm.com Phone (+44)/(0) 1962-815848 "Mike Beckerle" <mbeckerle.dfdl@gmail.com> Sent by: dfdl-wg-bounces@ogf.org 09/04/2008 01:34 Please respond to mbeckerle.dfdl@gmail.com To Alan Powell/UK/IBM@IBMGB, <dfdl-wg@ogf.org> cc Subject Re: [DFDL-WG] DFDL Regular Expression proposal Suggest add to ?lengthPattern? that the longest possible match is taken. This is the usual behavior for regular expressions, but it?s a clarification I?ve seen other places. From: dfdl-wg-bounces@ogf.org [mailto:dfdl-wg-bounces@ogf.org] On Behalf Of Alan Powell Sent: Thursday, April 03, 2008 12:44 PM To: dfdl-wg@ogf.org Subject: [DFDL-WG] DFDL Regular Expression proposal Attached is the proposal for the regular expression syntax used to determine element length. Highlights Based on the XML Schema regular expression subset used by WebSphere Message Broker. Only applies to representation = text Uses LengthPattern property rather than decorated syntax to distinguish from literals and regular expressions as it is only used in one place, this avoids everywhere else having to escape the decoration character and we are running out of decoration characters. Assumes the pattern is converted to the data code page before matching against the data stream. Comments and improvements as soon as possible please. Alan Powell MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England Notes Id: Alan Powell/UK/IBM email: alan_powell@uk.ibm.com Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org http://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (1)
-
Alan Powell