
We discussed the correct XML schema type for DFDL String Literal on the last WG call. I read up on xs:NMTOKEN - not appropriate as it is basically a name so does not allow the full range of characters we need. Then I looked at restricting xs:token, but I could not work out from the XML Schema 1.0 spec how whitespace facets were handled when other facets were present. So I asked Sandy, and got the very useful clarification below. Please review for next call. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 17/07/2013 16:01 ----- From: Sandy Gao/Toronto/IBM@IBMCA To: Steve Hanson/UK/IBM@IBMGB, Date: 17/07/2013 13:33 Subject: Re: DFDL String Literal type Hi Steve, Yes, that should work. All other facet checking, including pattern, happens *after* whitespace handling. This was made clearer in Schema 1.1, where "whitespace" is called a pre-lexical facet, and "pattern" etc. are called lexical facets. Thanks, Sandy Gao Source Code Monitoring (SCMon) IBM Canada sandygao@ca.ibm.com From: Steve Hanson/UK/IBM@IBMGB To: Sandy Gao/Toronto/IBM@IBMCA, Date: 2013-07-17 06:07 AM Subject: DFDL String Literal type Hi Sandy Please can I ask your advice on use of the whitespace facet in conjunction with the pattern facet? This is in order to model the correct data type for a DFDL String Literals. This is defined as: DFDL String Literal DFDL String Literals represent a sequence of literal bytes or characters which appear in the data stream. This presents the following challenges - the literal characters in the data stream might not be in the same encoding as the DFDL schema - it may be necessary to specify a literal character which is not valid in an XML document - it may be necessary to specify one or more raw byte values A DFDL string literal can describe any of the following types of literal data in any combination: - a single literal character in any encoding - a string of literal characters in any encoding - a bi-directional character string - one or more characters from a set of related characters ( e.g. end-of-line characters) - a literal byte value A DFDL string literal is therefore able to describe any arbitrary sequence of bytes and characters. Empty Strings: Empty string is not allowed as a DFDL string literal value unless explicitly stated otherwise in the description of a property. In this case the use of empty string provides some property specific behavior different from simply using the empty string as a value. When the empty string is to be used as a value, the entity %ES; must be used in the corresponding DFDL string literal. Whitespace: When whitespace must be used as part of a property value, the DFDL string literal must use entities (such as %WSP;) to represent the whitespace. (This allows a property to represent lists of DFDL string literals by using literal spaces to separate list elements.) The nearest match to an XSDL built-in type is xs:token, but we require the additional constraint that no whitespace can appear. My thought is to define a restriction of xs:token that applies a pattern facet to disallow use of #x20, given that the whitespace 'collapse' implied by xs:token would have replaced #x9, #xA, #xD with #x20, collapsed contiguous #x20, and trimmed leading/trailing #x20. Does that sound right? Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Well, it's looking to me like xml/xsd just doesn't have the right pre-defined whitespace-handling concepts that DFDL needs for DFDL String Literal nor for DFDL Expression. The whitespace-separated list of DFDL String literals works, but this is almost by accident. If xml/xsd aren't going to have the right thing for us, I think we should state our own rules, and we should avoid deriving from the behavor of xs:token because it collapses even quoted whitespace inside expressions, which is very undesirable. To me given this value " { ../foo eq ' . ' } " the whitespace everywhere except between the single quotes is insignificant and can be collapsed, but collapsing shouldn't mess with a schema author's quoted strings. Yes we have dfdl:decodeDFDLEntities('%SP;%SP;.%SP;%SP;") which could be plugged in instead. But I think this is a hack. So to me, from the XSD schema of DFDL annotations point of view, DFDL expression is a whitespace-preserving string, and DFDL String Literal is as well. The DFDL implementation must then provide the behavior for removal of insignificant whitespace. For DFDL Expressions, all whitespace is insignificant except that between quotation marks which is significant. For DFDL String Literals, no whitespace is allowed, and DFDL Character Entities must be used. On Wed, Jul 17, 2013 at 11:06 AM, Steve Hanson <smh@uk.ibm.com> wrote:
We discussed the correct XML schema type for DFDL String Literal on the last WG call. I read up on xs:NMTOKEN - not appropriate as it is basically a name so does not allow the full range of characters we need. Then I looked at restricting xs:token, but I could not work out from the XML Schema 1.0 spec how whitespace facets were handled when other facets were present. So I asked Sandy, and got the very useful clarification below. Please review for next call.
Regards
Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 17/07/2013 16:01 -----
From: Sandy Gao/Toronto/IBM@IBMCA To: Steve Hanson/UK/IBM@IBMGB, Date: 17/07/2013 13:33 Subject: Re: DFDL String Literal type ------------------------------
Hi Steve,
Yes, that should work. All other facet checking, including pattern, happens *after* whitespace handling.
This was made clearer in Schema 1.1, where "whitespace" is called a pre-lexical facet, and "pattern" etc. are called lexical facets.
Thanks, Sandy Gao Source Code Monitoring (SCMon) IBM Canada* **sandygao@ca.ibm.com* <sandygao@ca.ibm.com>
From: Steve Hanson/UK/IBM@IBMGB To: Sandy Gao/Toronto/IBM@IBMCA, Date: 2013-07-17 06:07 AM Subject: DFDL String Literal type ------------------------------
Hi Sandy
Please can I ask your advice on use of the whitespace facet in conjunction with the pattern facet? This is in order to model the correct data type for a DFDL String Literals. This is defined as:
*DFDL String Literal* *DFDL String Literals represent a sequence of literal bytes or characters which appear in the data stream. This presents the following challenges*
*- the literal characters in the data stream might not be in the same encoding as the DFDL schema*
*- it may be necessary to specify a literal character which is not valid in an XML document*
*- it may be necessary to specify one or more raw byte values*
*A DFDL string literal can describe any of the following types of literal data in any combination:*
*- a single literal character in any encoding*
*- a string of literal characters in any encoding*
*- a bi-directional character string*
*- one or more characters from a set of related characters ( e.g. end-of-line characters)*
*- a literal byte value *
*A DFDL string literal is therefore able to describe any arbitrary sequence of bytes and characters.*
*Empty Strings:** Empty string is not allowed as a DFDL string literal value unless explicitly stated otherwise in the description of a property. In this case the use of empty string provides some property specific behavior different from simply using the empty string as a value. When the empty string is to be used as a value, the entity %ES; must be used in the corresponding DFDL string literal.*
*Whitespace: **When whitespace must be used as part of a property value, the DFDL string literal must use entities (such as %WSP;) to represent the whitespace. (This allows a property to represent lists of DFDL string literals by using literal spaces to separate list elements.)*
The nearest match to an XSDL built-in type is xs:token, but we require the additional constraint that no whitespace can appear. My thought is to define a restriction of xs:token that applies a pattern facet to disallow use of #x20, given that the whitespace 'collapse' implied by xs:token would have replaced #x9, #xA, #xD with #x20, collapsed contiguous #x20, and trimmed leading/trailing #x20. Does that sound right?
Regards
Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
-- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com

I agree with all of that. The best way to specify the type of a DFDL string literal in the 'schema for DFDL annotations' would be: - define a global simple type called 'DFDLStringLiteral' that is a restriction of xs:string ( not xs:token ) and contains a pattern facet that describes its lexical space.. - define a separate global simple type 'ListOfDFDLStringLiteral' that is a list of DFDLStringLiteral regards, Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org Date: 17/07/2013 20:35 Subject: Re: [DFDL-WG] Fw: DFDL String Literal type Sent by: dfdl-wg-bounces@ogf.org Well, it's looking to me like xml/xsd just doesn't have the right pre-defined whitespace-handling concepts that DFDL needs for DFDL String Literal nor for DFDL Expression. The whitespace-separated list of DFDL String literals works, but this is almost by accident. If xml/xsd aren't going to have the right thing for us, I think we should state our own rules, and we should avoid deriving from the behavor of xs:token because it collapses even quoted whitespace inside expressions, which is very undesirable. To me given this value " { ../foo eq ' . ' } " the whitespace everywhere except between the single quotes is insignificant and can be collapsed, but collapsing shouldn't mess with a schema author's quoted strings. Yes we have dfdl:decodeDFDLEntities('%SP;%SP;.%SP;%SP;") which could be plugged in instead. But I think this is a hack. So to me, from the XSD schema of DFDL annotations point of view, DFDL expression is a whitespace-preserving string, and DFDL String Literal is as well. The DFDL implementation must then provide the behavior for removal of insignificant whitespace. For DFDL Expressions, all whitespace is insignificant except that between quotation marks which is significant. For DFDL String Literals, no whitespace is allowed, and DFDL Character Entities must be used. On Wed, Jul 17, 2013 at 11:06 AM, Steve Hanson <smh@uk.ibm.com> wrote: We discussed the correct XML schema type for DFDL String Literal on the last WG call. I read up on xs:NMTOKEN - not appropriate as it is basically a name so does not allow the full range of characters we need. Then I looked at restricting xs:token, but I could not work out from the XML Schema 1.0 spec how whitespace facets were handled when other facets were present. So I asked Sandy, and got the very useful clarification below. Please review for next call. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 17/07/2013 16:01 ----- From: Sandy Gao/Toronto/IBM@IBMCA To: Steve Hanson/UK/IBM@IBMGB, Date: 17/07/2013 13:33 Subject: Re: DFDL String Literal type Hi Steve, Yes, that should work. All other facet checking, including pattern, happens *after* whitespace handling. This was made clearer in Schema 1.1, where "whitespace" is called a pre-lexical facet, and "pattern" etc. are called lexical facets. Thanks, Sandy Gao Source Code Monitoring (SCMon) IBM Canada sandygao@ca.ibm.com From: Steve Hanson/UK/IBM@IBMGB To: Sandy Gao/Toronto/IBM@IBMCA, Date: 2013-07-17 06:07 AM Subject: DFDL String Literal type Hi Sandy Please can I ask your advice on use of the whitespace facet in conjunction with the pattern facet? This is in order to model the correct data type for a DFDL String Literals. This is defined as: DFDL String Literal DFDL String Literals represent a sequence of literal bytes or characters which appear in the data stream. This presents the following challenges - the literal characters in the data stream might not be in the same encoding as the DFDL schema - it may be necessary to specify a literal character which is not valid in an XML document - it may be necessary to specify one or more raw byte values A DFDL string literal can describe any of the following types of literal data in any combination: - a single literal character in any encoding - a string of literal characters in any encoding - a bi-directional character string - one or more characters from a set of related characters ( e.g. end-of-line characters) - a literal byte value A DFDL string literal is therefore able to describe any arbitrary sequence of bytes and characters. Empty Strings: Empty string is not allowed as a DFDL string literal value unless explicitly stated otherwise in the description of a property. In this case the use of empty string provides some property specific behavior different from simply using the empty string as a value. When the empty string is to be used as a value, the entity %ES; must be used in the corresponding DFDL string literal. Whitespace: When whitespace must be used as part of a property value, the DFDL string literal must use entities (such as %WSP;) to represent the whitespace. (This allows a property to represent lists of DFDL string literals by using literal spaces to separate list elements.) The nearest match to an XSDL built-in type is xs:token, but we require the additional constraint that no whitespace can appear. My thought is to define a restriction of xs:token that applies a pattern facet to disallow use of #x20, given that the whitespace 'collapse' implied by xs:token would have replaced #x9, #xA, #xD with #x20, collapsed contiguous #x20, and trimmed leading/trailing #x20. Does that sound right? Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg -- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

I agree that xs:string is necessary for DFDL Expressions and DFDL Regexs, that's what I recommended in the other thread. But I'm not seeing what is wrong with using xs:token as the base type for a DFDL String Literal. The replace/collapse algorithm: a) Removes leading/trailing whitespace, which we want to happen to handle element form b) Does not lose the fact that whitespace was there - you just end up with a single space. Which we can then detect as illegal. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Tim Kimber/UK/IBM@IBMGB To: dfdl-wg@ogf.org, Date: 18/07/2013 09:06 Subject: Re: [DFDL-WG] Fw: DFDL String Literal type Sent by: dfdl-wg-bounces@ogf.org I agree with all of that. The best way to specify the type of a DFDL string literal in the 'schema for DFDL annotations' would be: - define a global simple type called 'DFDLStringLiteral' that is a restriction of xs:string ( not xs:token ) and contains a pattern facet that describes its lexical space.. - define a separate global simple type 'ListOfDFDLStringLiteral' that is a list of DFDLStringLiteral regards, Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org Date: 17/07/2013 20:35 Subject: Re: [DFDL-WG] Fw: DFDL String Literal type Sent by: dfdl-wg-bounces@ogf.org Well, it's looking to me like xml/xsd just doesn't have the right pre-defined whitespace-handling concepts that DFDL needs for DFDL String Literal nor for DFDL Expression. The whitespace-separated list of DFDL String literals works, but this is almost by accident. If xml/xsd aren't going to have the right thing for us, I think we should state our own rules, and we should avoid deriving from the behavor of xs:token because it collapses even quoted whitespace inside expressions, which is very undesirable. To me given this value " { ../foo eq ' . ' } " the whitespace everywhere except between the single quotes is insignificant and can be collapsed, but collapsing shouldn't mess with a schema author's quoted strings. Yes we have dfdl:decodeDFDLEntities('%SP;%SP;.%SP;%SP;") which could be plugged in instead. But I think this is a hack. So to me, from the XSD schema of DFDL annotations point of view, DFDL expression is a whitespace-preserving string, and DFDL String Literal is as well. The DFDL implementation must then provide the behavior for removal of insignificant whitespace. For DFDL Expressions, all whitespace is insignificant except that between quotation marks which is significant. For DFDL String Literals, no whitespace is allowed, and DFDL Character Entities must be used. On Wed, Jul 17, 2013 at 11:06 AM, Steve Hanson <smh@uk.ibm.com> wrote: We discussed the correct XML schema type for DFDL String Literal on the last WG call. I read up on xs:NMTOKEN - not appropriate as it is basically a name so does not allow the full range of characters we need. Then I looked at restricting xs:token, but I could not work out from the XML Schema 1.0 spec how whitespace facets were handled when other facets were present. So I asked Sandy, and got the very useful clarification below. Please review for next call. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 17/07/2013 16:01 ----- From: Sandy Gao/Toronto/IBM@IBMCA To: Steve Hanson/UK/IBM@IBMGB, Date: 17/07/2013 13:33 Subject: Re: DFDL String Literal type Hi Steve, Yes, that should work. All other facet checking, including pattern, happens *after* whitespace handling. This was made clearer in Schema 1.1, where "whitespace" is called a pre-lexical facet, and "pattern" etc. are called lexical facets. Thanks, Sandy Gao Source Code Monitoring (SCMon) IBM Canada sandygao@ca.ibm.com From: Steve Hanson/UK/IBM@IBMGB To: Sandy Gao/Toronto/IBM@IBMCA, Date: 2013-07-17 06:07 AM Subject: DFDL String Literal type Hi Sandy Please can I ask your advice on use of the whitespace facet in conjunction with the pattern facet? This is in order to model the correct data type for a DFDL String Literals. This is defined as: DFDL String Literal DFDL String Literals represent a sequence of literal bytes or characters which appear in the data stream. This presents the following challenges - the literal characters in the data stream might not be in the same encoding as the DFDL schema - it may be necessary to specify a literal character which is not valid in an XML document - it may be necessary to specify one or more raw byte values A DFDL string literal can describe any of the following types of literal data in any combination: - a single literal character in any encoding - a string of literal characters in any encoding - a bi-directional character string - one or more characters from a set of related characters ( e.g. end-of-line characters) - a literal byte value A DFDL string literal is therefore able to describe any arbitrary sequence of bytes and characters. Empty Strings: Empty string is not allowed as a DFDL string literal value unless explicitly stated otherwise in the description of a property. In this case the use of empty string provides some property specific behavior different from simply using the empty string as a value. When the empty string is to be used as a value, the entity %ES; must be used in the corresponding DFDL string literal. Whitespace: When whitespace must be used as part of a property value, the DFDL string literal must use entities (such as %WSP;) to represent the whitespace. (This allows a property to represent lists of DFDL string literals by using literal spaces to separate list elements.) The nearest match to an XSDL built-in type is xs:token, but we require the additional constraint that no whitespace can appear. My thought is to define a restriction of xs:token that applies a pattern facet to disallow use of #x20, given that the whitespace 'collapse' implied by xs:token would have replaced #x9, #xA, #xD with #x20, collapsed contiguous #x20, and trimmed leading/trailing #x20. Does that sound right? Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg -- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Based on the notes mail chain - I think we would have DFDLStringLiteral derived from xsd:token and restricted with a pattern facet which does not allow whitespace at all.. DFDLExpression derived from xsd:string with a pattern facet that does not allow whitespace before and after the curly braces.. Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Tim Kimber <KIMBERT@uk.ibm.com>, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org Date: 07/18/2013 05:21 AM Subject: Re: [DFDL-WG] Fw: DFDL String Literal type Sent by: dfdl-wg-bounces@ogf.org I agree that xs:string is necessary for DFDL Expressions and DFDL Regexs, that's what I recommended in the other thread. But I'm not seeing what is wrong with using xs:token as the base type for a DFDL String Literal. The replace/collapse algorithm: a) Removes leading/trailing whitespace, which we want to happen to handle element form b) Does not lose the fact that whitespace was there - you just end up with a single space. Which we can then detect as illegal. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Tim Kimber/UK/IBM@IBMGB To: dfdl-wg@ogf.org, Date: 18/07/2013 09:06 Subject: Re: [DFDL-WG] Fw: DFDL String Literal type Sent by: dfdl-wg-bounces@ogf.org I agree with all of that. The best way to specify the type of a DFDL string literal in the 'schema for DFDL annotations' would be: - define a global simple type called 'DFDLStringLiteral' that is a restriction of xs:string ( not xs:token ) and contains a pattern facet that describes its lexical space.. - define a separate global simple type 'ListOfDFDLStringLiteral' that is a list of DFDLStringLiteral regards, Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org Date: 17/07/2013 20:35 Subject: Re: [DFDL-WG] Fw: DFDL String Literal type Sent by: dfdl-wg-bounces@ogf.org Well, it's looking to me like xml/xsd just doesn't have the right pre-defined whitespace-handling concepts that DFDL needs for DFDL String Literal nor for DFDL Expression. The whitespace-separated list of DFDL String literals works, but this is almost by accident. If xml/xsd aren't going to have the right thing for us, I think we should state our own rules, and we should avoid deriving from the behavor of xs:token because it collapses even quoted whitespace inside expressions, which is very undesirable. To me given this value " { ../foo eq ' . ' } " the whitespace everywhere except between the single quotes is insignificant and can be collapsed, but collapsing shouldn't mess with a schema author's quoted strings. Yes we have dfdl:decodeDFDLEntities('%SP;%SP;.%SP;%SP;") which could be plugged in instead. But I think this is a hack. So to me, from the XSD schema of DFDL annotations point of view, DFDL expression is a whitespace-preserving string, and DFDL String Literal is as well. The DFDL implementation must then provide the behavior for removal of insignificant whitespace. For DFDL Expressions, all whitespace is insignificant except that between quotation marks which is significant. For DFDL String Literals, no whitespace is allowed, and DFDL Character Entities must be used. On Wed, Jul 17, 2013 at 11:06 AM, Steve Hanson <smh@uk.ibm.com> wrote: We discussed the correct XML schema type for DFDL String Literal on the last WG call. I read up on xs:NMTOKEN - not appropriate as it is basically a name so does not allow the full range of characters we need. Then I looked at restricting xs:token, but I could not work out from the XML Schema 1.0 spec how whitespace facets were handled when other facets were present. So I asked Sandy, and got the very useful clarification below. Please review for next call. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 17/07/2013 16:01 ----- From: Sandy Gao/Toronto/IBM@IBMCA To: Steve Hanson/UK/IBM@IBMGB, Date: 17/07/2013 13:33 Subject: Re: DFDL String Literal type Hi Steve, Yes, that should work. All other facet checking, including pattern, happens *after* whitespace handling. This was made clearer in Schema 1.1, where "whitespace" is called a pre-lexical facet, and "pattern" etc. are called lexical facets. Thanks, Sandy Gao Source Code Monitoring (SCMon) IBM Canada sandygao@ca.ibm.com From: Steve Hanson/UK/IBM@IBMGB To: Sandy Gao/Toronto/IBM@IBMCA, Date: 2013-07-17 06:07 AM Subject: DFDL String Literal type Hi Sandy Please can I ask your advice on use of the whitespace facet in conjunction with the pattern facet? This is in order to model the correct data type for a DFDL String Literals. This is defined as: DFDL String Literal DFDL String Literals represent a sequence of literal bytes or characters which appear in the data stream. This presents the following challenges - the literal characters in the data stream might not be in the same encoding as the DFDL schema - it may be necessary to specify a literal character which is not valid in an XML document - it may be necessary to specify one or more raw byte values A DFDL string literal can describe any of the following types of literal data in any combination: - a single literal character in any encoding - a string of literal characters in any encoding - a bi-directional character string - one or more characters from a set of related characters ( e.g. end-of-line characters) - a literal byte value A DFDL string literal is therefore able to describe any arbitrary sequence of bytes and characters. Empty Strings: Empty string is not allowed as a DFDL string literal value unless explicitly stated otherwise in the description of a property. In this case the use of empty string provides some property specific behavior different from simply using the empty string as a value. When the empty string is to be used as a value, the entity %ES; must be used in the corresponding DFDL string literal. Whitespace: When whitespace must be used as part of a property value, the DFDL string literal must use entities (such as %WSP;) to represent the whitespace. (This allows a property to represent lists of DFDL string literals by using literal spaces to separate list elements.) The nearest match to an XSDL built-in type is xs:token, but we require the additional constraint that no whitespace can appear. My thought is to define a restriction of xs:token that applies a pattern facet to disallow use of #x20, given that the whitespace 'collapse' implied by xs:token would have replaced #x9, #xA, #xD with #x20, collapsed contiguous #x20, and trimmed leading/trailing #x20. Does that sound right? Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg -- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg

For DFDL expression, that won't work. We have to be able to allow the user to format expressions across multiple lines. Any leading/trailing whitespace must be stripped by the DFDL implementation. Here's an example from DFDL spec section 7.3: <dfdl:assert message="Precondition violation.” > {../x le 0 and ../y ne "-->" and ..y ne "<!—" } </dfdl:assert> Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Suman Kalia <kalia@ca.ibm.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Tim Kimber/UK/IBM@IBMGB Date: 18/07/2013 14:40 Subject: Re: [DFDL-WG] Fw: DFDL String Literal type Based on the notes mail chain - I think we would have DFDLStringLiteral derived from xsd:token and restricted with a pattern facet which does not allow whitespace at all.. DFDLExpression derived from xsd:string with a pattern facet that does not allow whitespace before and after the curly braces.. Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Tim Kimber <KIMBERT@uk.ibm.com>, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org Date: 07/18/2013 05:21 AM Subject: Re: [DFDL-WG] Fw: DFDL String Literal type Sent by: dfdl-wg-bounces@ogf.org I agree that xs:string is necessary for DFDL Expressions and DFDL Regexs, that's what I recommended in the other thread. But I'm not seeing what is wrong with using xs:token as the base type for a DFDL String Literal. The replace/collapse algorithm: a) Removes leading/trailing whitespace, which we want to happen to handle element form b) Does not lose the fact that whitespace was there - you just end up with a single space. Which we can then detect as illegal. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Tim Kimber/UK/IBM@IBMGB To: dfdl-wg@ogf.org, Date: 18/07/2013 09:06 Subject: Re: [DFDL-WG] Fw: DFDL String Literal type Sent by: dfdl-wg-bounces@ogf.org I agree with all of that. The best way to specify the type of a DFDL string literal in the 'schema for DFDL annotations' would be: - define a global simple type called 'DFDLStringLiteral' that is a restriction of xs:string ( not xs:token ) and contains a pattern facet that describes its lexical space.. - define a separate global simple type 'ListOfDFDLStringLiteral' that is a list of DFDLStringLiteral regards, Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org Date: 17/07/2013 20:35 Subject: Re: [DFDL-WG] Fw: DFDL String Literal type Sent by: dfdl-wg-bounces@ogf.org Well, it's looking to me like xml/xsd just doesn't have the right pre-defined whitespace-handling concepts that DFDL needs for DFDL String Literal nor for DFDL Expression. The whitespace-separated list of DFDL String literals works, but this is almost by accident. If xml/xsd aren't going to have the right thing for us, I think we should state our own rules, and we should avoid deriving from the behavor of xs:token because it collapses even quoted whitespace inside expressions, which is very undesirable. To me given this value " { ../foo eq ' . ' } " the whitespace everywhere except between the single quotes is insignificant and can be collapsed, but collapsing shouldn't mess with a schema author's quoted strings. Yes we have dfdl:decodeDFDLEntities('%SP;%SP;.%SP;%SP;") which could be plugged in instead. But I think this is a hack. So to me, from the XSD schema of DFDL annotations point of view, DFDL expression is a whitespace-preserving string, and DFDL String Literal is as well. The DFDL implementation must then provide the behavior for removal of insignificant whitespace. For DFDL Expressions, all whitespace is insignificant except that between quotation marks which is significant. For DFDL String Literals, no whitespace is allowed, and DFDL Character Entities must be used. On Wed, Jul 17, 2013 at 11:06 AM, Steve Hanson <smh@uk.ibm.com> wrote: We discussed the correct XML schema type for DFDL String Literal on the last WG call. I read up on xs:NMTOKEN - not appropriate as it is basically a name so does not allow the full range of characters we need. Then I looked at restricting xs:token, but I could not work out from the XML Schema 1.0 spec how whitespace facets were handled when other facets were present. So I asked Sandy, and got the very useful clarification below. Please review for next call. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 ----- Forwarded by Steve Hanson/UK/IBM on 17/07/2013 16:01 ----- From: Sandy Gao/Toronto/IBM@IBMCA To: Steve Hanson/UK/IBM@IBMGB, Date: 17/07/2013 13:33 Subject: Re: DFDL String Literal type Hi Steve, Yes, that should work. All other facet checking, including pattern, happens *after* whitespace handling. This was made clearer in Schema 1.1, where "whitespace" is called a pre-lexical facet, and "pattern" etc. are called lexical facets. Thanks, Sandy Gao Source Code Monitoring (SCMon) IBM Canada sandygao@ca.ibm.com From: Steve Hanson/UK/IBM@IBMGB To: Sandy Gao/Toronto/IBM@IBMCA, Date: 2013-07-17 06:07 AM Subject: DFDL String Literal type Hi Sandy Please can I ask your advice on use of the whitespace facet in conjunction with the pattern facet? This is in order to model the correct data type for a DFDL String Literals. This is defined as: DFDL String Literal DFDL String Literals represent a sequence of literal bytes or characters which appear in the data stream. This presents the following challenges - the literal characters in the data stream might not be in the same encoding as the DFDL schema - it may be necessary to specify a literal character which is not valid in an XML document - it may be necessary to specify one or more raw byte values A DFDL string literal can describe any of the following types of literal data in any combination: - a single literal character in any encoding - a string of literal characters in any encoding - a bi-directional character string - one or more characters from a set of related characters ( e.g. end-of-line characters) - a literal byte value A DFDL string literal is therefore able to describe any arbitrary sequence of bytes and characters. Empty Strings: Empty string is not allowed as a DFDL string literal value unless explicitly stated otherwise in the description of a property. In this case the use of empty string provides some property specific behavior different from simply using the empty string as a value. When the empty string is to be used as a value, the entity %ES; must be used in the corresponding DFDL string literal. Whitespace: When whitespace must be used as part of a property value, the DFDL string literal must use entities (such as %WSP;) to represent the whitespace. (This allows a property to represent lists of DFDL string literals by using literal spaces to separate list elements.) The nearest match to an XSDL built-in type is xs:token, but we require the additional constraint that no whitespace can appear. My thought is to define a restriction of xs:token that applies a pattern facet to disallow use of #x20, given that the whitespace 'collapse' implied by xs:token would have replaced #x9, #xA, #xD with #x20, collapsed contiguous #x20, and trimmed leading/trailing #x20. Does that sound right? Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg -- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (4)
-
Mike Beckerle
-
Steve Hanson
-
Suman Kalia
-
Tim Kimber