proposal: DFDL needs additional function dfdl:characterCode

An important use case for DFDL is converting legacy data to/from XML. XML 1.0 disallows a bunch of string characters. If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints. The natural thing to do is create an element containing the character code of the illegal character, as an integer. E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>. This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer. But we need a function dfdl:characterCode(str, pos) : int The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur. For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg. Example Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc. In hex that would be 313233 00 343536 01 373839 02 313233 The best I can think of for modeling this while preserving all information would end up with XML looking like this: <nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString> So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar. The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off. stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters. nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code. If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.) But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions. -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412

From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs post-processing." Mike has observed that other software systems "map the illegal characters to/from the Unicode Private Use Area." Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 04/10/2012 23:39 Subject: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org An important use case for DFDL is converting legacy data to/from XML. XML 1.0 disallows a bunch of string characters. If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints. The natural thing to do is create an element containing the character code of the illegal character, as an integer. E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>. This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer. But we need a function dfdl:characterCode(str, pos) : int The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur. For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg. Example Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc. In hex that would be 313233 00 343536 01 373839 02 313233 The best I can think of for modeling this while preserving all information would end up with XML looking like this: <nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString> So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar. The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off. stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters. nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code. If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.) But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions. -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Shouldn't we be using entity references for XML syntactic character found in text/binary data while creating info set and vice versa... Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, Cc: dfdl-wg@ogf.org Date: 11/01/2012 07:57 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org
From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs post-processing." Mike has observed that other software systems "map the illegal characters to/from the Unicode Private Use Area." Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 04/10/2012 23:39 Subject: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org An important use case for DFDL is converting legacy data to/from XML. XML 1.0 disallows a bunch of string characters. If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints. The natural thing to do is create an element containing the character code of the illegal character, as an integer. E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>. This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer. But we need a function dfdl:characterCode(str, pos) : int The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur. For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg. Example Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc. In hex that would be 313233 00 343536 01 373839 02 313233 The best I can think of for modeling this while preserving all information would end up with XML looking like this: <nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString> So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar. The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off. stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters. nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code. If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.) But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions. -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg

If you use XML-specific entity references then you have forced all consumers of the DFDL Infoset to be XML aware. The DFDL infoset is (deliberately) not an XML infoset. If I am parsing a string that contains x'08' there is nothing intrinsically wrong with that code point. It's only a problem if it is subsequently serialised as XML. (If we wanted to have an XML focus to the DFDL infoset then we would have gone down the XDM route, an approach which was rejected). Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Suman Kalia <kalia@ca.ibm.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Mike Beckerle <mbeckerle.dfdl@gmail.com> Date: 01/11/2012 12:47 Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Shouldn't we be using entity references for XML syntactic character found in text/binary data while creating info set and vice versa... Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, Cc: dfdl-wg@ogf.org Date: 11/01/2012 07:57 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org
From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs post-processing." Mike has observed that other software systems "map the illegal characters to/from the Unicode Private Use Area." Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 04/10/2012 23:39 Subject: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org An important use case for DFDL is converting legacy data to/from XML. XML 1.0 disallows a bunch of string characters. If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints. The natural thing to do is create an element containing the character code of the illegal character, as an integer. E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>. This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer. But we need a function dfdl:characterCode(str, pos) : int The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur. For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg. Example Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc. In hex that would be 313233 00 343536 01 373839 02 313233 The best I can think of for modeling this while preserving all information would end up with XML looking like this: <nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString> So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar. The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off. stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters. nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code. If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.) But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions. -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

I have an xsd element of type string , the text data pertaining to it is as in Mike's example .. What I infer from your note is that in DFDL infoset , the string will appear as such ie. 123<0>456<1>789<2>123l. It is up to the user to parse this string and handle syntactic characters if he wants to render this in XML ?? Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Suman Kalia/Toronto/IBM@IBMCA, Cc: dfdl-wg@ogf.org Date: 11/01/2012 09:07 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode If you use XML-specific entity references then you have forced all consumers of the DFDL Infoset to be XML aware. The DFDL infoset is (deliberately) not an XML infoset. If I am parsing a string that contains x'08' there is nothing intrinsically wrong with that code point. It's only a problem if it is subsequently serialised as XML. (If we wanted to have an XML focus to the DFDL infoset then we would have gone down the XDM route, an approach which was rejected). Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Suman Kalia <kalia@ca.ibm.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Mike Beckerle <mbeckerle.dfdl@gmail.com> Date: 01/11/2012 12:47 Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Shouldn't we be using entity references for XML syntactic character found in text/binary data while creating info set and vice versa... Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, Cc: dfdl-wg@ogf.org Date: 11/01/2012 07:57 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org
From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs post-processing." Mike has observed that other software systems "map the illegal characters to/from the Unicode Private Use Area." Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 04/10/2012 23:39 Subject: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org An important use case for DFDL is converting legacy data to/from XML. XML 1.0 disallows a bunch of string characters. If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints. The natural thing to do is create an element containing the character code of the illegal character, as an integer. E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>. This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer. But we need a function dfdl:characterCode(str, pos) : int The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur. For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg. Example Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc. In hex that would be 313233 00 343536 01 373839 02 313233 The best I can think of for modeling this while preserving all information would end up with XML looking like this: <nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString> So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar. The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off. stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters. nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code. If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.) But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions. -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Not sure what you mean by 123<0>456<1>789<2>1231. I assume you mean x0031x0032x0033x0000x0034x0035x0036x0001x0037.... as that is what the DFDL infoset will contain if the data is modelled as a single xs:string. If that is to go out as an XML string then clearly some processing must be done on it to make it legal. All I am saying is it is not the job of the DFDL parser to ensure that the data is in a form suitable for XML, or any other format for that matter. XML interoperability is an important use case for DFDL but it's not the only one. IBM WMB works in the same way. Data is parsed into the MB tree in Unicode. It is the job of the MB XML serializer to decide what to do with characters that are illegal in XML. Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Suman Kalia <kalia@ca.ibm.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org Date: 01/11/2012 13:25 Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode I have an xsd element of type string , the text data pertaining to it is as in Mike's example .. What I infer from your note is that in DFDL infoset , the string will appear as such ie. 123<0>456<1>789<2>123l. It is up to the user to parse this string and handle syntactic characters if he wants to render this in XML ?? Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Suman Kalia/Toronto/IBM@IBMCA, Cc: dfdl-wg@ogf.org Date: 11/01/2012 09:07 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode If you use XML-specific entity references then you have forced all consumers of the DFDL Infoset to be XML aware. The DFDL infoset is (deliberately) not an XML infoset. If I am parsing a string that contains x'08' there is nothing intrinsically wrong with that code point. It's only a problem if it is subsequently serialised as XML. (If we wanted to have an XML focus to the DFDL infoset then we would have gone down the XDM route, an approach which was rejected). Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Suman Kalia <kalia@ca.ibm.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Mike Beckerle <mbeckerle.dfdl@gmail.com> Date: 01/11/2012 12:47 Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Shouldn't we be using entity references for XML syntactic character found in text/binary data while creating info set and vice versa... Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, Cc: dfdl-wg@ogf.org Date: 11/01/2012 07:57 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org
From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs post-processing." Mike has observed that other software systems "map the illegal characters to/from the Unicode Private Use Area." Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 04/10/2012 23:39 Subject: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org An important use case for DFDL is converting legacy data to/from XML. XML 1.0 disallows a bunch of string characters. If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints. The natural thing to do is create an element containing the character code of the illegal character, as an integer. E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>. This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer. But we need a function dfdl:characterCode(str, pos) : int The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur. For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg. Example Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc. In hex that would be 313233 00 343536 01 373839 02 313233 The best I can think of for modeling this while preserving all information would end up with XML looking like this: <nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString> So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar. The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off. stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters. nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code. If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.) But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions. -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

I think all implementations will have to solve this problem. XML interchange is an important use case. The question is just whether the DFDL *standard* says exactly how to do it or we leave it up to implementations and standardize an approach later. ...mike On Thu, Nov 1, 2012 at 9:25 AM, Suman Kalia <kalia@ca.ibm.com> wrote:
I have an xsd element of type string , the text data pertaining to it is as in Mike's example .. What I infer from your note is that in DFDL infoset , the string will appear as such ie. 123<0>456<1>789<2>123l. It is up to the user to parse this string and handle syntactic characters if he wants to render this in XML ??
Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com
For info on Message broker
http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht...
From: Steve Hanson <smh@uk.ibm.com> To: Suman Kalia/Toronto/IBM@IBMCA, Cc: dfdl-wg@ogf.org Date: 11/01/2012 09:07 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode ------------------------------
If you use XML-specific entity references then you have forced all consumers of the DFDL Infoset to be XML aware. The DFDL infoset is (deliberately) not an XML infoset. If I am parsing a string that contains x'08' there is nothing intrinsically wrong with that code point. It's only a problem if it is subsequently serialised as XML. (If we wanted to have an XML focus to the DFDL infoset then we would have gone down the XDM route, an approach which was rejected).
Regards
Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848
From: Suman Kalia <kalia@ca.ibm.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Mike Beckerle < mbeckerle.dfdl@gmail.com> Date: 01/11/2012 12:47 Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode ------------------------------
Shouldn't we be using entity references for XML syntactic character found in text/binary data while creating info set and vice versa...
Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com
For info on Message broker * ** http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... *<http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html>
From: Steve Hanson <smh@uk.ibm.com> To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, Cc: dfdl-wg@ogf.org Date: 11/01/2012 07:57 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org ------------------------------
From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs post-processing."
Mike has observed that other software systems "map the illegal characters to/from the Unicode Private Use Area."
Regards
Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848
From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 04/10/2012 23:39 Subject: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org ------------------------------
An important use case for DFDL is converting legacy data to/from XML.
XML 1.0 disallows a bunch of string characters.
If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints.
The natural thing to do is create an element containing the character code of the illegal character, as an integer.
E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>.
This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer.
But we need a function dfdl:characterCode(str, pos) : int
The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur.
For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg.
Example
Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc.
In hex that would be 313233 00 343536 01 373839 02 313233
The best I can think of for modeling this while preserving all information would end up with XML looking like this:
<nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString>
So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar.
The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off.
stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters.
nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code.
If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.)
But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions.
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: *781-330-0412* <781-330-0412> -- dfdl-wg mailing list dfdl-wg@ogf.org* **https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org* **https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412

Agree . XML is an important use case.. We certainly want to provide guide guidance to the user on how to do. If we can standardize, it would be great... Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Suman Kalia/Toronto/IBM@IBMCA, Cc: Steve Hanson <smh@uk.ibm.com>, dfdl-wg@ogf.org Date: 11/01/2012 11:08 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode I think all implementations will have to solve this problem. XML interchange is an important use case. The question is just whether the DFDL standard says exactly how to do it or we leave it up to implementations and standardize an approach later. ...mike On Thu, Nov 1, 2012 at 9:25 AM, Suman Kalia <kalia@ca.ibm.com> wrote: I have an xsd element of type string , the text data pertaining to it is as in Mike's example .. What I infer from your note is that in DFDL infoset , the string will appear as such ie. 123<0>456<1>789<2>123l. It is up to the user to parse this string and handle syntactic characters if he wants to render this in XML ?? Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Suman Kalia/Toronto/IBM@IBMCA, Cc: dfdl-wg@ogf.org Date: 11/01/2012 09:07 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode If you use XML-specific entity references then you have forced all consumers of the DFDL Infoset to be XML aware. The DFDL infoset is (deliberately) not an XML infoset. If I am parsing a string that contains x'08' there is nothing intrinsically wrong with that code point. It's only a problem if it is subsequently serialised as XML. (If we wanted to have an XML focus to the DFDL infoset then we would have gone down the XDM route, an approach which was rejected). Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Suman Kalia <kalia@ca.ibm.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org, Mike Beckerle < mbeckerle.dfdl@gmail.com> Date: 01/11/2012 12:47 Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Shouldn't we be using entity references for XML syntactic character found in text/binary data while creating info set and vice versa... Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, Cc: dfdl-wg@ogf.org Date: 11/01/2012 07:57 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org
From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs post-processing." Mike has observed that other software systems "map the illegal characters to/from the Unicode Private Use Area." Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 04/10/2012 23:39 Subject: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org An important use case for DFDL is converting legacy data to/from XML. XML 1.0 disallows a bunch of string characters. If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints. The natural thing to do is create an element containing the character code of the illegal character, as an integer. E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>. This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer. But we need a function dfdl:characterCode(str, pos) : int The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur. For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg. Example Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc. In hex that would be 313233 00 343536 01 373839 02 313233 The best I can think of for modeling this while preserving all information would end up with XML looking like this: <nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString> So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar. The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off. stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters. nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code. If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.) But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions. -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412

I'll tell you what I'm planning to do inside of a few weeks in the daffodil project, since we have a community that has lots of binary data with illegal XML chars in it, who view XML interchange as their primary use case since complex validation follows DFDL processing, and that occurs on XML data. I'm planning to provide two functions in the function library daffodil:translateIllegalXMLCharsToPUA(arg) : xs:string daffodil:translatePUAToIllegalXMLChars(arg) : xs:string These take any of the illegal XML codepoints between 0x00 and 0x20, and add 0xE000 to their codepoint value to put them into the private use area (PUA) codepoints so as to preserve their information content. These functions can be used inside of inputValueCalc and outputValueCalc to deal with strings that have these illegal xml codepoints in them. Might also provide an implementation specific property: daffodil:encodingModifiers="translateIllegalXMLCharsToPUA" which is a list of implementation specific flag/modifier strings. This one would mean that all strings are to be treated in this way when parsing, and the inverse when unparsing. For now, this will work only for single-byte encodings. It will be a schema-def error if you use it with a multi-byte encoding or variable-width encoding like utf-8. On Thu, Nov 1, 2012 at 11:13 AM, Suman Kalia <kalia@ca.ibm.com> wrote:
Agree . XML is an important use case.. We certainly want to provide guide guidance to the user on how to do. If we can standardize, it would be great...
Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com
For info on Message broker
http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht...
From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Suman Kalia/Toronto/IBM@IBMCA, Cc: Steve Hanson <smh@uk.ibm.com>, dfdl-wg@ogf.org Date: 11/01/2012 11:08 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode ------------------------------
I think all implementations will have to solve this problem. XML interchange is an important use case.
The question is just whether the DFDL *standard* says exactly how to do it or we leave it up to implementations and standardize an approach later.
...mike
On Thu, Nov 1, 2012 at 9:25 AM, Suman Kalia <*kalia@ca.ibm.com*<kalia@ca.ibm.com>> wrote: I have an xsd element of type string , the text data pertaining to it is as in Mike's example .. What I infer from your note is that in DFDL infoset , the string will appear as such ie. 123<0>456<1>789<2>123l. It is up to the user to parse this string and handle syntactic characters if he wants to render this in XML ??
Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: *905-413-3923* <905-413-3923> T/L 313-3923 Email: *kalia@ca.ibm.com* <kalia@ca.ibm.com>
For info on Message broker * ** http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... *<http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html>
From: Steve Hanson <*smh@uk.ibm.com* <smh@uk.ibm.com>> To: Suman Kalia/Toronto/IBM@IBMCA, Cc: *dfdl-wg@ogf.org* <dfdl-wg@ogf.org> Date: 11/01/2012 09:07 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode ------------------------------
If you use XML-specific entity references then you have forced all consumers of the DFDL Infoset to be XML aware. The DFDL infoset is (deliberately) not an XML infoset. If I am parsing a string that contains x'08' there is nothing intrinsically wrong with that code point. It's only a problem if it is subsequently serialised as XML. (If we wanted to have an XML focus to the DFDL infoset then we would have gone down the XDM route, an approach which was rejected).
Regards
Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:*+44-1962-815848* <%2B44-1962-815848>
From: Suman Kalia <*kalia@ca.ibm.com* <kalia@ca.ibm.com>> To: Steve Hanson/UK/IBM@IBMGB, Cc: *dfdl-wg@ogf.org* <dfdl-wg@ogf.org>, *dfdl-wg-bounces@ogf.org*<dfdl-wg-bounces@ogf.org>, Mike Beckerle <*mbeckerle.dfdl@gmail.com* <mbeckerle.dfdl@gmail.com>> Date: 01/11/2012 12:47 Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode ------------------------------
Shouldn't we be using entity references for XML syntactic character found in text/binary data while creating info set and vice versa...
Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: *905-413-3923* <905-413-3923> T/L 313-3923 Email: *kalia@ca.ibm.com* <kalia@ca.ibm.com>
For info on Message broker * ** http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... *<http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.html>
From: Steve Hanson <*smh@uk.ibm.com* <smh@uk.ibm.com>> To: Mike Beckerle <*mbeckerle.dfdl@gmail.com*<mbeckerle.dfdl@gmail.com>>,
Cc: *dfdl-wg@ogf.org* <dfdl-wg@ogf.org> Date: 11/01/2012 07:57 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: *dfdl-wg-bounces@ogf.org* <dfdl-wg-bounces@ogf.org> ------------------------------
From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs post-processing."
Mike has observed that other software systems "map the illegal characters to/from the Unicode Private Use Area."
Regards
Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:*+44-1962-815848* <%2B44-1962-815848>
From: Mike Beckerle <*mbeckerle.dfdl@gmail.com*<mbeckerle.dfdl@gmail.com>
To: *dfdl-wg@ogf.org* <dfdl-wg@ogf.org>, Date: 04/10/2012 23:39 Subject: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: *dfdl-wg-bounces@ogf.org* <dfdl-wg-bounces@ogf.org> ------------------------------
An important use case for DFDL is converting legacy data to/from XML.
XML 1.0 disallows a bunch of string characters.
If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints.
The natural thing to do is create an element containing the character code of the illegal character, as an integer.
E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>.
This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer.
But we need a function dfdl:characterCode(str, pos) : int
The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur.
For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg.
Example
Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc.
In hex that would be 313233 00 343536 01 373839 02 313233
The best I can think of for modeling this while preserving all information would end up with XML looking like this:
<nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString>
So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar.
The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off.
stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters.
nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code.
If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.)
But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions.
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: *781-330-0412* <781-330-0412> -- dfdl-wg mailing list* **dfdl-wg@ogf.org* <dfdl-wg@ogf.org>* **https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list* **dfdl-wg@ogf.org* <dfdl-wg@ogf.org>* **https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
-- dfdl-wg mailing list *dfdl-wg@ogf.org* <dfdl-wg@ogf.org> *https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412

Turns out the XML char entities are not an escape scheme for putting illegal chars in. E.g. is illegal even expressed that way. The char entities are essentially an internationalization hack so you can enter and render any legal character using only a small charset. On Nov 1, 2012 8:48 AM, "Suman Kalia" <kalia@ca.ibm.com> wrote:
Shouldn't we be using entity references for XML syntactic character found in text/binary data while creating info set and vice versa...
Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com
For info on Message broker
http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht...
From: Steve Hanson <smh@uk.ibm.com> To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, Cc: dfdl-wg@ogf.org Date: 11/01/2012 07:57 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org ------------------------------
From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs post-processing."
Mike has observed that other software systems "map the illegal characters to/from the Unicode Private Use Area."
Regards
Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, *OGF DFDL Working Group* <http://www.ogf.org/dfdl/> IBM SWG, Hursley, UK* **smh@uk.ibm.com* <smh@uk.ibm.com> tel:+44-1962-815848
From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 04/10/2012 23:39 Subject: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org ------------------------------
An important use case for DFDL is converting legacy data to/from XML.
XML 1.0 disallows a bunch of string characters.
If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints.
The natural thing to do is create an element containing the character code of the illegal character, as an integer.
E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>.
This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer.
But we need a function dfdl:characterCode(str, pos) : int
The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur.
For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg.
Example
Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc.
In hex that would be 313233 00 343536 01 373839 02 313233
The best I can think of for modeling this while preserving all information would end up with XML looking like this:
<nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString>
So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar.
The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off.
stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters.
nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code.
If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.)
But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions.
-- Mike Beckerle | OGF DFDL WG Co-Chair Tel: *781-330-0412* <781-330-0412> -- dfdl-wg mailing list dfdl-wg@ogf.org *https://www.ogf.org/mailman/listinfo/dfdl-wg*<https://www.ogf.org/mailman/listinfo/dfdl-wg>
Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg

Correct - there is no way at all of using illegal characters in an XML document. Not CDATA, not character entities. They simply must not appear anywhere. I agree with Steve that XML compatibility is not the only requirement for a DFDL info set - we should not do anything to make it XML specific in a way that harms it generality. To balance that, I also think that the DFDL Working Group should be paying attention to the issues around XML compatibility, given that DFDL is based on XML Schema and many potential adopters of DFDL will want to know about XML compatibility. Mike's proposal of mapping illegal characters into the Unicode Private Use Area sounds like a reasonable approach for implementers to use. regards, Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Suman Kalia <kalia@ca.ibm.com>, Cc: dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org Date: 01/11/2012 15:14 Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org Turns out the XML char entities are not an escape scheme for putting illegal chars in. E.g. is illegal even expressed that way. The char entities are essentially an internationalization hack so you can enter and render any legal character using only a small charset. On Nov 1, 2012 8:48 AM, "Suman Kalia" <kalia@ca.ibm.com> wrote: Shouldn't we be using entity references for XML syntactic character found in text/binary data while creating info set and vice versa... Suman Kalia IBM Canada Lab WMB Toolkit Architect and Development Lead Tel: 905-413-3923 T/L 313-3923 Email: kalia@ca.ibm.com For info on Message broker http://www.ibm.com/developerworks/websphere/zones/businessintegration/wmb.ht... From: Steve Hanson <smh@uk.ibm.com> To: Mike Beckerle <mbeckerle.dfdl@gmail.com>, Cc: dfdl-wg@ogf.org Date: 11/01/2012 07:57 AM Subject: Re: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org
From WG call minutes 2012-10-30:
"Beyond the scope of DFDL 1.0. Assumption for now is that infoset needs post-processing." Mike has observed that other software systems "map the illegal characters to/from the Unicode Private Use Area." Regards Steve Hanson Architect, Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: dfdl-wg@ogf.org, Date: 04/10/2012 23:39 Subject: [DFDL-WG] proposal: DFDL needs additional function dfdl:characterCode Sent by: dfdl-wg-bounces@ogf.org An important use case for DFDL is converting legacy data to/from XML. XML 1.0 disallows a bunch of string characters. If the data contains those characters, then the question arises of what to turn them into that both preserves information content, but also is legal in XML so that you can convert the DFDL infoset into XML without violating XML's constraints. The natural thing to do is create an element containing the character code of the illegal character, as an integer. E.g., character code U+0001 would become an element. Such as: <ccode>1</ccode>. This could be done using a hidden element that is a string, and the element ccode above would have an inputValueCalc that converts the offending character of that string into an integer. But we need a function dfdl:characterCode(str, pos) : int The arguments would be a string, and a position (base 1) within that string, and the return result would be the character code of the character in the string at that position. If pos is out of the bounds of the string (i.e., is negative, 0, or too large), then a processing error would occur. For unparsing the inverse function would also be needed: dfdl:character(intArg) : string. This would return a string containing one character whose codepoint is the intArg. Example Consider this data: 123<0>456<1>789<2>123l where <0> means just one character with codepoint 0, etc. In hex that would be 313233 00 343536 01 373839 02 313233 The best I can think of for modeling this while preserving all information would end up with XML looking like this: <nonXMLString> <fragment><stringData>123</stringData></fragment> <fragment><nonXMLChar><charCode>0</charCode></nonXMLChar></fragment> <fragment><stringData>456</stringData></fragment> <fragment><nonXMLChar><charCode>1</charCode></nonXMLChar></fragment> <fragment><stringData>789</stringData></fragment> <fragment><nonXMLChar><charCode>2</charCode></nonXMLChar></fragment> <fragment><stringData>123</stringData></fragment> </nonXMLString> So our nonXMLString is of a type which is array of fragment, a fragment is a choice of either (legal XML) stringData, or a nonXMLChar. The nonXMLChar has a child element because it will need to convert to from a string so will use inputValueCalc and outputValueCalc to do so, so it needs to be a sequence so that it can have the other hidden elements needed to pull this off. stringData would have lengthKind="pattern" and a pattern that allows any sequence of XML-allowed characters. nonXMLChar would have a hidden first child element of type string of explicit length 1 with an assertion that the string match a pattern that is any of the illegal characters (but just one of them). The charCode child element would inputValueCalc to get the character code of the character. For 8 bit encodings it would be ok as a table lookup in XPath, but for unicode..... we'd need a function that returns a character code. If you just have one embedded illegal character, like NUL, then you could just model it as a separator, which would simplify things considerably (and is possible in a someday XML 1.1 future since NUL is then the only disallowed character.) But for XML 1.0's illegal characters, we need to be able to convert to/from some non-string representation if we are to preserve information content. Hence we need these additional functions. -- Mike Beckerle | OGF DFDL WG Co-Chair Tel: 781-330-0412 -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg -- dfdl-wg mailing list dfdl-wg@ogf.org https://www.ogf.org/mailman/listinfo/dfdl-wg Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (4)
-
Mike Beckerle
-
Steve Hanson
-
Suman Kalia
-
Tim Kimber