We discussed the correct XML schema type for DFDL String Literal on the last WG call.  I read up on xs:NMTOKEN  - not appropriate as it is basically a name so does not allow the full range of characters we need. Then I looked at restricting xs:token, but I could not work out from the XML Schema 1.0 spec how whitespace facets were handled when other facets were present.  So I asked Sandy, and got the very useful clarification below. Please review for next call.
 
Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair,
OGF DFDL Working Group
IBM SWG, Hursley, UK

smh@uk.ibm.com
tel:+44-1962-815848

----- Forwarded by Steve Hanson/UK/IBM on 17/07/2013 16:01 -----

From:        Sandy Gao/Toronto/IBM@IBMCA
To:        Steve Hanson/UK/IBM@IBMGB,
Date:        17/07/2013 13:33
Subject:        Re: DFDL String Literal type



Hi Steve,

Yes, that should work. All other facet checking, including pattern, happens *after* whitespace handling.

This was made clearer in Schema 1.1, where "whitespace" is called a pre-lexical facet, and "pattern" etc. are called lexical facets.

Thanks,
Sandy Gao
Source Code Monitoring (SCMon)
IBM Canada

sandygao@ca.ibm.com



From:        Steve Hanson/UK/IBM@IBMGB
To:        Sandy Gao/Toronto/IBM@IBMCA,
Date:        2013-07-17 06:07 AM
Subject:        DFDL String Literal type



Hi Sandy

Please can I ask your advice on use of the whitespace facet in conjunction with the pattern facet?  This is in order to model the correct data type for a DFDL String Literals. This is defined as:


DFDL String Literal
DFDL String Literals represent a sequence of literal bytes or characters which appear in the data stream. This presents the following challenges

-        the literal characters in the data stream might not be in the same encoding as the DFDL schema

-        it may be necessary to specify a literal character which is not valid in an XML document

-        it may be necessary to specify one or more raw byte values

A DFDL string literal can describe any of the following types of literal data in any combination:

-        a single literal character in any encoding

-        a string of literal characters in any encoding

-        a bi-directional character string

-        one or more characters from a set of related characters ( e.g. end-of-line characters)

-        a literal byte value

A DFDL string literal is therefore able to describe any arbitrary sequence of bytes and characters.

Empty Strings: Empty string is not allowed as a DFDL string literal value unless explicitly stated otherwise in the description of a property. In this case the use of empty string provides some property specific behavior different from simply using the empty string as a value. When the empty string is to be used as a value, the entity %ES; must be used in the corresponding DFDL string literal.

Whitespace: When whitespace must be used as part of a property value, the DFDL string literal must use entities (such as %WSP;) to represent the whitespace. (This allows a property to represent lists of DFDL string literals by using literal spaces to separate list elements.)

The nearest match to an XSDL built-in type is xs:token, but we require the additional constraint that no whitespace can appear.  My thought is to define a restriction of xs:token that applies a pattern facet to disallow use of #x20, given that the whitespace 'collapse' implied by xs:token would have replaced #x9, #xA, #xD with #x20, collapsed contiguous #x20, and trimmed leading/trailing #x20.  Does that sound right?

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair,
OGF DFDL Working Group
IBM SWG, Hursley, UK

smh@uk.ibm.com
tel:+44-1962-815848


Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU