

A good summary of full XML schema regular expressions is here

Quick summary of subset

XML Schema - Regular Expression -

Meta Characters   - all supported

Normal Characters - no restrictions

Single Character Escape Sequence - all supported

Multiple Character Escape Sequences -
Multiple Character Escape Sequences
. Any character except '\n' (newline) and '\r' (return).
\s Whitespace, specifically '' (space), '\t' (tab), '\n' (newline) and '\r' (return).
\S Any character except those matched by '\s'.
\i The first character in an XML identifier. Specifically, any letter, the character '_', or the character ':', See the XML Recommendation for the complex specification of a letter. This character represents a subset of letter that might appear in '\c'.
\I Any character except those matched by '\i'.
\c Any character that might appear in the built-in NMTOKEN datatype. See the XML Recommendation for the complex specification of a NameChar.
\C Any character except those matched by '\c'.
\d Any Decimal digit. A shortcut for '\p{Nd}'.
\D Any character except those matched by '\d'.
\w Any character that might appear in a word. A shortcut for '[#X0000-#x10FFFF]-[\p{P}\p{Z}\p{C}]' (all characters except the set of "punctuation", "separator", and "other" characters).
\W Any character except those matched by '\w'.

Character Categories
Character Category
L Letter, Any  
Lu Letter, Uppercase  
Ll Letter, Lowercase  
Lt Letter, Titlecase  
Lm Letter, Modifier  
Lo Letter, Other  
L Letter, uppercase, lowercase, and titlecase letters (Lu, Ll, and Lt) Optional in The Unicode Standard; not supported by the Schema Recommendation.
M Mark, Any  
Mn Mark, Nonspacing  
Mc Mark, Spacing Combining  
Me Mark, Enclosing  
N Number, Any  
Nd Number, Decimal Digit  
Nl Number, Letter  
No Number, Other  
P Punctuation, Any  
Pc Punctuation, Connector  
Pd Punctuation, Dash  
Ps Punctuation, Open  
Pe Punctuation, Close  
Pi Punctuation, Initial quote (may behave like Ps or Pe, depending on usage)  
Pf Punctuation, Final quote (may behave like Ps or Pe, depending on usage)  
Po Punctuation, Other  
S Symbol, Any  
Sm Symbol, Math  
Sc Symbol, Currency  
Sk Symbol, Modifier  
So Symbol, Other  
Z Separator, Any  
Zs Separator, Space  
Zl Separator, Line  
Zp Separator, Paragraph  
C Other, Any  
Cc Other, Control  
Cf Other, Format  
Cs Other, Surrogate (not supported by Schema Recommendation). Explicitly not supported by Schema Recommendation.
Co Other, Private Use  
Cn Other, Not Assigned (no characters in the file have this property).  

Character Blocks  - None supported (only really meaningful in unicode)

XML Character References - supported

2) - 6) see below

Comments from Steve and Ian:

1) The subset proposed is basically lifted from the IBM MRM parser help.  If I ever knew what the rationale for the subset was, I don't know it now. What features have we excluded?

2) IBM MRM parser has extended the xsd regular expression syntax to allow hexadecimal characters using the following syntax:
\xNN hexadecimal digits in the range 0 to F

MRM makes much wider use of regular expressions, as an alternative to speculative parsing, so I can see why MRM needed this (one concrete use case was for TLOG retail messages). Do we need to support this in DFDL?  

Is this documented? I think we need to allow hex characters

3) If we don't add the hex support, what are the use cases for using a dfdl:lengthPattern versus using an xsd pattern facet?  It looks like pattern facets apply to all supported schema simple types, so not clear why dfdl:lengthPattern would be needed. The only use case I can think of is where we have length on a complex element or sequence or choice. If this is the only use case perhaps dfdl:lengthPattern should only be used in those cases?  MRM allows this use. (It might also answer 2 as it allows embedded binary data to appear).  Or is there a distinction between validation and parsing?

The xsd:pattern operates on the logical contents, Lengthpattern operates on the physical contents including markup.

4) What is the behaviour on unparsing?  I believe that MRM simply takes the value presented to it and outputs it (it does not attempt to match it against the pattern), so DFDL equivalent would be to outout the infoset value.


5) For a repeating element, presumably we would consume only as match as the number of occurs dictates.

Good question. I had assumed lengthPattern had the same semantics as length.

6) Should state explicitly that DFDL entity references are not allowed. The XML character reference is used instead &#xNN;

Need to support DFDL entities to allow x00

Suggest add to “lengthPattern” that the longest possible match is taken. This is the usual behavior for regular expressions, but it’s a clarification I’ve seen other places.

From: [] On Behalf Of Alan Powell
Thursday, April 03, 2008 12:44 PM
[DFDL-WG] DFDL Regular Expression proposal


Attached is the proposal for the regular expression syntax used to determine element length.


Comments and improvements as soon as possible please.

