Re: [DFDL-WG] DFDL Regular Expression proposal
Steve
1)
A good summary of full XML schema regular expressions is here
http://www.xmlschemareference.com/regularExpression.html
Quick summary of subset
XML Schema - Regular Expression -
Meta Characters - all supported
Normal Characters - no restrictions
Single Character Escape Sequence - all supported
Multiple Character Escape Sequences -
Multiple Character Escape Sequences
Description
.
Any character except '\n' (newline) and '\r' (return).
\s
Whitespace, specifically '' (space), '\t' (tab), '\n' (newline) and
'\r' (return).
\S
Any character except those matched by '\s'.
\i
The first character in an XML identifier. Specifically, any letter, the
character '_', or the character ':', See the XML Recommendation for the
complex specification of a letter. This character represents a subset of
letter that might appear in '\c'.
\I
Any character except those matched by '\i'.
\c
Any character that might appear in the built-in NMTOKEN datatype. See the
XML Recommendation for the complex specification of a NameChar.
\C
Any character except those matched by '\c'.
\d
Any Decimal digit. A shortcut for '\p{Nd}'.
\D
Any character except those matched by '\d'.
\w
Any character that might appear in a word. A shortcut for
'[#X0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]' (all characters except the set of
"punctuation", "separator", and "other" characters).
\W
Any character except those matched by '\w'.
Character Categories
Character Category
Description
Notes
L
Letter, Any
Lu
Letter, Uppercase
Ll
Letter, Lowercase
Lt
Letter, Titlecase
Lm
Letter, Modifier
Lo
Letter, Other
L
Letter, uppercase, lowercase, and titlecase letters (Lu, Ll, and Lt)
Optional in The Unicode Standard; not supported by the Schema
Recommendation.
M
Mark, Any
Mn
Mark, Nonspacing
Mc
Mark, Spacing Combining
Me
Mark, Enclosing
N
Number, Any
Nd
Number, Decimal Digit
Nl
Number, Letter
No
Number, Other
P
Punctuation, Any
Pc
Punctuation, Connector
Pd
Punctuation, Dash
Ps
Punctuation, Open
Pe
Punctuation, Close
Pi
Punctuation, Initial quote (may behave like Ps or Pe, depending on usage)
Pf
Punctuation, Final quote (may behave like Ps or Pe, depending on usage)
Po
Punctuation, Other
S
Symbol, Any
Sm
Symbol, Math
Sc
Symbol, Currency
Sk
Symbol, Modifier
So
Symbol, Other
Z
Separator, Any
Zs
Separator, Space
Zl
Separator, Line
Zp
Separator, Paragraph
C
Other, Any
Cc
Other, Control
Cf
Other, Format
Cs
Other, Surrogate (not supported by Schema Recommendation).
Explicitly not supported by Schema Recommendation.
Co
Other, Private Use
Cn
Other, Not Assigned (no characters in the file have this property).
Character Blocks - None supported (only really meaningful in unicode)
XML Character References - supported
2) - 6) see below
Alan Powell
MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell@uk.ibm.com
Tel: +44 (0)1962 815073 Fax: +44 (0)1962 816898
From:
Steve Hanson/UK/IBM
To:
Alan Powell/UK/IBM
Cc:
dfdl-wg@ogf.org, mbeckerle.dfdl@gmail.com
Date:
09/04/2008 13:58
Subject:
Re: [DFDL-WG] DFDL Regular Expression proposal
Comments from Steve and Ian:
1) The subset proposed is basically lifted from the IBM MRM parser help.
If I ever knew what the rationale for the subset was, I don't know it now.
What features have we excluded?
2) IBM MRM parser has extended the xsd regular expression syntax to allow
hexadecimal characters using the following syntax:
\xNN
hexadecimal digits in the range 0 to F
MRM makes much wider use of regular expressions, as an alternative to
speculative parsing, so I can see why MRM needed this (one concrete use
case was for TLOG retail messages). Do we need to support this in DFDL?
Is this documented? I think we need to allow hex characters
3) If we don't add the hex support, what are the use cases for using a
dfdl:lengthPattern versus using an xsd pattern facet? It looks like
pattern facets apply to all supported schema simple types, so not clear
why dfdl:lengthPattern would be needed. The only use case I can think of
is where we have length on a complex element or sequence or choice. If
this is the only use case perhaps dfdl:lengthPattern should only be used
in those cases? MRM allows this use. (It might also answer 2 as it allows
embedded binary data to appear). Or is there a distinction between
validation and parsing?
The xsd:pattern operates on the logical contents, Lengthpattern operates
on the physical contents including markup.
4) What is the behaviour on unparsing? I believe that MRM simply takes
the value presented to it and outputs it (it does not attempt to match it
against the pattern), so DFDL equivalent would be to outout the infoset
value.
Agree
5) For a repeating element, presumably we would consume only as match as
the number of occurs dictates.
Good question. I had assumed lengthPattern had the same semantics as
length.
6) Should state explicitly that DFDL entity references are not allowed.
The XML character reference is used instead NN;
Need to support DFDL entities to allow x00
Regards, Steve
Steve Hanson
WebSphere Message Brokers
Hursley, UK
Internet: smh@uk.ibm.com
Phone (+44)/(0) 1962-815848
"Mike Beckerle"
participants (1)
-
Alan Powell