Yes it is regrettable that we need that 2nd case escape sequence, but I did some real digging into this, and the w3c groups would I believe not be receptive to reopening this issue. It's not a syntactic thing, it's about C language API access to the XML Infoset. No string from XML data can have a nul (zero) char code in its content so that XML string content is compatible with C's nul terminated string convention. Basically, this restriction lets fairly ordinary C code manipulate XML infoset content. If the nul char code restriction were not there then you'd need a whole non-standard string library in C to deal with XML infoset contents. I don't think this is feasible. The other character codes they disallow are about the non-characters of Unicode. I.e., the XML infoset can't contain surrogate-pair fragments. Rather the infoset contents must contain the full characters those represent, and finally the byte-order-mark codepoints are not considered valid contents for the infoset either.

Note that for DFDL, since data we're describing can contain NUL in strings, a C-callable API to a DFDL described 'infoset' must allow for the strings to contain NUL bytes. As a prescriptive format for data, XML could get away with solving the problem by avoidance. As a descriptive approach DFDL does not have this luxury, so dealing with DFDL-described string data from C means dealing with string data with C where the string terminating NUL conventions aren't necessarily respected. Similarly we must be able to deal with things that contain surrogate pairs, even unpaired, and byte order marks, etc.

I considered whether we could get away with just the 'bytes' escape sequence (the \%xHH; format in my proposed syntax), but I concluded we also need the '\#xHH;' syntax for dealing with UTF-16 and UTF-32 character sets, which I expect will become more common over time, particularly UTF-16.

Also, one of the strong arguments for '\#xHH;' syntax is the 'foolish consistency' argument. You can always use '\#xHH;', you don't have to think about whether XML allows the codepoint or not . So it simplifies life for DFDL schema authors. DFDL authors can just uniformly use '\#xHHHHHH;' (up to 6 hex digits, maximum value 10FFFF) when they want to specify a unicode character code and forget about XML character references entirely if they want.

BTW: I consider the "\#" as similar to '&#' in XML syntax. I.e., you can specify hex via the 'x' or decimal without the 'x', and up to 6 hex digits are allowed, 7 decimal digits (max value 1114111 decimal, 0x10FFFF hex). This is different from the '\%HH; syntax which requires exactly one or two hex digits. This behavior is needed to deal with endianness issues. I.e., in a single byte charset if you allowed '\%xFEFF;' then which byte appears at the lower address in the data, the #xFE or the #xFF ? The above restriction avoids all of that at the cost of requiring you to explicitly specify '\%xFE;\%xFF;'.

As for syntax '\#0xHH;' or '\%xHH;'. I used a percent sign because it's used for hex byte specifications in a URL standard format which is a very very minor precedent. I am a bit concerned that people will expect '\#xHH;' and '\#0xHH;' to be synonyms. I.e., that use of the leading zero seems rather subtle. I'd kind of prefer an entirely different escape sequence indicator than a subtle distinction like this. What's going on is quite significantly different. One is talking about unicode codepoints, the other about representation bytes. I think they shoudl be loudly separated.

Mike Beckerle
Architect, Scalable Computing
IBM Software Group
Information Integration Solutions
Westborough, MA

Steve Hanson <smh@uk.ibm.com>
Sent by: owner-dfdl-wg@ggf.org

09/15/2005 04:56 AM

To	dfdl-wg@gridforum.org
cc
Subject	Fw: [dfdl-wg] minutes of call 2005-09-14

Hi Mike Re the three escape mechanisms. It is unfortunate that we need a separate mechanism for the second case (as discussed in the draft spec section 26). Re the third mechanism for escaping hex literals. An alternative syntax could be \#0x2C; using 0x to indicate hex. &#0x2C; would be even better - but does it cause XML problems? Regards, Steve ----- Forwarded by Steve Hanson/UK/IBM on 15/09/2005 09:33 ----- Mike Beckerle <beckerle@us.ibm. com> To Sent by: dfdl-wg@gridforum.org owner-dfdl-wg@ggf cc .org Subject [dfdl-wg] minutes of call 14/09/2005 22:18 2005-09-14 Who: Jim Myers, Mike Beckerle, Tara Talbott, Steve Hanson, Geoff Judd, Bob McGrath Agenda: GGF plans - we will have F2F at GGF15. The schedule of our WG sessions at the GGF15 are not yet set, but we'll have F2F meetings during the non-scheduled times. We're counting on the fact that most of our DFDL WG members are focused on DFDL and won't mind the conflict with the rest of GGF15 much. People should make travel plans for all 3 days M, T, W. GridForge Forums - these seem to be working now. Will try them when one of our subcommittees reports back to the broader WG. E.g., scoping or arrays. Issues list - we made it through items 5 to 15 Issue 5 - resolution. No built-in set of defaults. If you need a property and your configuration doesn't have one specified then it is an error. There is a small set of named configurations provided with DFDL each of which is a self-consistent set of properties. This is probably the set that we find useful in the primer and other WG docs. Issue 6 - resolution. change attribute name from 'base' to 'extends' Issue 7 - belongs on same tracker as 8, 9, 10 Issue 11 - resolution: add an attribute 'byteOrderMarkPolicy' values are: required, notAllowed, optionalButGenerateOnOutput, optionalButOmitOnOutput (the generate on output control aspect was not discussed on the call. I thought of that just now while typing.) Issue 12 - fix diagram - omission of these types was unintentional. Better diagram next draft Issue 13 and 14 - new tracker item - Steve Hanson, Geoff Judd and someone from MikeB's team at IBM to address Issue 15 - resolution: each delimiter will grow an extra rep-property for its regexp variant. E.g., separator will have corresponding separatorRegexp property. Only one of the two may be specified. We discussed that separator (or any delimiter) can be a text string, and we had previously decided that the separatorEncoding attribute would go away to be replaced by a syntax for expressing hex bytes (not hex character codes, real non-charset transformed hex bytes) as part of delimiter strng literals. This same way of putting hex bytes would also be supported as part of the regexp language so that the regexp language remains able to include any non-regular expression for a delimiter into a regular expression. (Editor's Note: this is a bit tricky. We now have in string literals 3 different escape mechanisms that are different. 1) XML character code specifiers e.g., 'abc,def' which is the normal XML way of specifying the unicode #x2C character code. (This is an XML standard escape convention.) Used inside a delimiter string this means take the unicode #x2c code point, figure out what character it corresponds to in the charset of the data element and use that character. Note that this is the same thing that 'a' means. Take the 'a' unicode code point, figure out what character corresponds to 'a' in the charset of the data element, and look for that. 2) nonXML character code specifiers e.g., 'abc\#x00;def' which is the way we allow you to put XML-disallowed unicode character codes like #x00 into a string literal. (This is a DFDL convention). The interpretation here is exactly like the above. I.e., '\#x2c;' is exactly equivalent to ',' and the charset mapping applies as above. This rule is only needed because of the XML restriction disallowing certain unicode character codes from the XML infoset. 3) non-character byte specifiers. E.g., 'abc\%01;\%02;def' which is a proposed escape syntax that means put the bytes #x01 and #x02 into the string literal bypassing any considerations of charset, i.e., without considering them to be character codes in any character set. That is, these byte values have nothing to do with unicode codepoint values. (This is a DFDL convention.) The other characters of the string would be treated as per (1) and/or (2) above. )