Re: [DFDL-WG] Fw: clarification or maybe restriction needed: delimited hexBinary question

5 Jul 2016

      WG agreed that an implementation note be added to the relevant section to 
explain the implications for the delimiter scanner if an implementation 
supports the 'delimited binary' or 'raw byte entity in delimiters' 
optional features.  

https://redmine.ogf.org/issues/315

Regards

Steve Hanson
IBM Integration Bus, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890

From:   Steve Hanson/UK/IBM
To:     mbeckerle.dfdl@gmail.com
Cc:     DFDL-WG <dfdl-wg@ogf.org>, Andrew Edwards/UK/IBM@IBMGB
Date:   16/06/2016 16:02
Subject:        Re: Fw: [DFDL-WG] clarification or maybe restriction 
needed: delimited       hexBinary question

Mike

Your example below should use dfdl:terminator not dfdl:separator but that 
does not materially change anything.

I think it means that the data is raw bytes, no character decoding is 
happening at runtime. The delimiters are converted to bytes, and the scan 
is to find those bytes.

Correct. 

However, that leaves off the ability to use delimiters with character 
class entities, like separator="年%WSP+;" since WSP+ one or more 
repeats of any of several whitespace characters. It can't just be 
converted to a sequence of bytes. 

The IBM DFDL scanner converts everything to bytes, including WSP+ and WSP* 
delimiters, even when matching characters, so the scan is purely 
byte-based. Originally this was so a scan handled raw byte entities and 
parent delimiters with different encodings. This was extended to handle 
the delimited binary elements erratum - the only difference is that bytes 
read from the input stream are not converted from an encoding.

Regards

Steve Hanson
IBM Integration Bus, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890

----- Forwarded by Steve Hanson/UK/IBM on 16/06/2016 09:34 -----

From: Mike Beckerle <mbeckerle.dfdl@gmail.com>
To: "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>
Date: 15/06/2016 21:41
Subject: [DFDL-WG] clarification or maybe restriction needed: delimited 
hexBinary question
Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org> 

In this example:

<xs:element name="foo" type="xs:hexBinary" dfdl:lengthKind="delimited" 
dfdl:separator="年" dfdl:encoding="UTF-8" />

Note that 年 is 年 which requires 3 bytes in UTF-8: E5 B9 B4

The spec doesn't really say what happens here. It talks about scanning bcd 
and packed data for delimiters (I think the use case is TLOG format?), but 
doesn't really talk about what that means.

I think it means that the data is raw bytes, no character decoding is 
happening at runtime. The delimiters are converted to bytes, and the scan 
is to find those bytes.

However, that leaves off the ability to use delimiters with character 
class entities, like separator="年%WSP+;" since WSP+ one or more 
repeats of any of several whitespace characters. It can't just be 
converted to a sequence of bytes. 

I did not find a restriction in the DFDL spec on what the delimiters can 
contain when used for delimited binary data. E.g., only raw bytes or no 
char class entities.
Perhaps there is such a restriction and I just didn't find it?

If not, perhaps we need such a restriction, just to simplify implementor's 
lives, and avoid features nobody needs. 

Consider this: Easy implementation tricks like this don't generalize:

<xs:element name="foo" type="xs:hexBinary" dfdl:lengthKind="delimited" 
dfdl:separator="å¹´" dfdl:encoding="iso-8859-1" />

The separator now contains the 3 bytes of the UTF-8 character, but as 
individual characters in iso-8859-1 where byte values and unicode 
codepoints are the same.

It doesn't work because char class entities like WSP+ remain problematic. 
As a UTF-8 WSP+ allows repeats of any of the byte sequences corresponding 
to these unicode characters: 
U+0009-U+000D (Control characters) 
U+0020 SPACE 
U+0085 NEL 
U+00A0 NBSP 
U+1680 OGHAM SPACE MARK 
U+180E MONGOLIAN VOWEL SEPARATOR 
U+2000-U+200A (different sorts of spaces) 
U+2028 LSP 
U+2029 PSP 
U+202F NARROW NBSP 
U+205F MEDIUM MATHEMATICAL SPACE 
U+3000 IDEOGRAPHIC SPACE

I can't express with separator, a repeating disjunction of the byte 
sequences corresponding to the above. 

Now, I think all this complexity adds no value for anyone. 

To avoid all this, I would propose these restrictions on delimited binary 
data 
1) can only use SBCS encodings
2) no repeating char-class entities (WSP*, WSP+) are allowed.

With those restrictions I believe the "trick" above of using an iso-8859-1 
"decoder", and converting the delimiters into iso-8859-1 character 
sequences, can be made to work. 

Comments?

...mikeb

Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject to the OGF Intellectual Property Policy
--
 dfdl-wg mailing list
 dfdl-wg@ogf.org
 https://www.ogf.org/mailman/listinfo/dfdl-wg

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Steve Hanson

tags

participants (1)