clarification or maybe restriction needed: delimited hexBinary question

In this example: <xs:element name="foo" type="xs:hexBinary" dfdl:lengthKind="delimited" dfdl:separator="年" dfdl:encoding="UTF-8" /> Note that 年 is 年 which requires 3 bytes in UTF-8: E5 B9 B4 The spec doesn't really say what happens here. It talks about scanning bcd and packed data for delimiters (I think the use case is TLOG format?), but doesn't really talk about what that means. I think it means that the data is raw bytes, no character decoding is happening at runtime. The delimiters are converted to bytes, and the scan is to find those bytes. However, that leaves off the ability to use delimiters with character class entities, like separator="年%WSP+;" since WSP+ one or more repeats of any of several whitespace characters. It can't just be converted to a sequence of bytes. I did not find a restriction in the DFDL spec on what the delimiters can contain when used for delimited binary data. E.g., only raw bytes or no char class entities. Perhaps there is such a restriction and I just didn't find it? If not, perhaps we need such a restriction, just to simplify implementor's lives, and avoid features nobody needs. Consider this: Easy implementation tricks like this don't generalize: <xs:element name="foo" type="xs:hexBinary" dfdl:lengthKind="delimited" dfdl:separator="å¹´" dfdl:encoding="iso-8859-1" /> The separator now contains the 3 bytes of the UTF-8 character, but as individual characters in iso-8859-1 where byte values and unicode codepoints are the same. It doesn't work because char class entities like WSP+ remain problematic. As a UTF-8 WSP+ allows repeats of any of the byte sequences corresponding to these unicode characters: U+0009-U+000D (Control characters) U+0020 SPACE U+0085 NEL U+00A0 NBSP U+1680 OGHAM SPACE MARK U+180E MONGOLIAN VOWEL SEPARATOR U+2000-U+200A (different sorts of spaces) U+2028 LSP U+2029 PSP U+202F NARROW NBSP U+205F MEDIUM MATHEMATICAL SPACE U+3000 IDEOGRAPHIC SPACE I can't express with separator, a repeating disjunction of the byte sequences corresponding to the above. Now, I think all this complexity adds no value for anyone. To avoid all this, I would propose these restrictions on delimited binary data 1) can only use SBCS encodings 2) no repeating char-class entities (WSP*, WSP+) are allowed. With those restrictions I believe the "trick" above of using an iso-8859-1 "decoder", and converting the delimiters into iso-8859-1 character sequences, can be made to work. Comments? ...mikeb Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Please note: Contributions to the DFDL Workgroup's email discussions are subject to the OGF Intellectual Property Policy <http://www.ogf.org/About/abt_policies.php>
participants (1)
-
Mike Beckerle