
We discussed the correct XML schema type for DFDL String Literal on the last WG call. I read up on xs:NMTOKEN - not appropriate as it is basically a name so does not allow the full range of characters we need. Then I looked at restricting xs:token, but I could not work out from the XML Schema 1.0 spec how whitespace facets were handled when other facets were present. So I asked Sandy, and got the very useful clarification below. Please review for next call. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 17/07/2013 16:01 ----- From: Sandy Gao/Toronto/IBM@IBMCA To: Steve Hanson/UK/IBM@IBMGB, Date: 17/07/2013 13:33 Subject: Re: DFDL String Literal type Hi Steve, Yes, that should work. All other facet checking, including pattern, happens *after* whitespace handling. This was made clearer in Schema 1.1, where "whitespace" is called a pre-lexical facet, and "pattern" etc. are called lexical facets. Thanks, Sandy Gao Source Code Monitoring (SCMon) IBM Canada sandygao@ca.ibm.com From: Steve Hanson/UK/IBM@IBMGB To: Sandy Gao/Toronto/IBM@IBMCA, Date: 2013-07-17 06:07 AM Subject: DFDL String Literal type Hi Sandy Please can I ask your advice on use of the whitespace facet in conjunction with the pattern facet? This is in order to model the correct data type for a DFDL String Literals. This is defined as: DFDL String Literal DFDL String Literals represent a sequence of literal bytes or characters which appear in the data stream. This presents the following challenges - the literal characters in the data stream might not be in the same encoding as the DFDL schema - it may be necessary to specify a literal character which is not valid in an XML document - it may be necessary to specify one or more raw byte values A DFDL string literal can describe any of the following types of literal data in any combination: - a single literal character in any encoding - a string of literal characters in any encoding - a bi-directional character string - one or more characters from a set of related characters ( e.g. end-of-line characters) - a literal byte value A DFDL string literal is therefore able to describe any arbitrary sequence of bytes and characters. Empty Strings: Empty string is not allowed as a DFDL string literal value unless explicitly stated otherwise in the description of a property. In this case the use of empty string provides some property specific behavior different from simply using the empty string as a value. When the empty string is to be used as a value, the entity %ES; must be used in the corresponding DFDL string literal. Whitespace: When whitespace must be used as part of a property value, the DFDL string literal must use entities (such as %WSP;) to represent the whitespace. (This allows a property to represent lists of DFDL string literals by using literal spaces to separate list elements.) The nearest match to an XSDL built-in type is xs:token, but we require the additional constraint that no whitespace can appear. My thought is to define a restriction of xs:token that applies a pattern facet to disallow use of #x20, given that the whitespace 'collapse' implied by xs:token would have replaced #x9, #xA, #xD with #x20, collapsed contiguous #x20, and trimmed leading/trailing #x20. Does that sound right? Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU