Re: [DFDL-WG] clarification needed for scan and scannable

14 Aug 2013

      Action 217 raised to decide new terminology for the regex scanning 
encoding requirement.

Regards

Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848

From:   Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:     Tim Kimber/UK/IBM@IBMGB, 
Cc:     Steve Hanson/UK/IBM@IBMGB
Date:   30/07/2013 01:01
Subject:        Re: clarification needed for scan and scannable

I reviewed the spec. I have to reverse my prior statement. What you 
describe is definitely currently allowed by the spec. The 'scannable' 
restriction is reserved for lengthKind='pattern'. 

In my view this is a terrible idea, but it is where it is. I.e., that the 
lengthKind 'delimited' doesn't require 'scannable' characteristic 
throughout everything that is contained within it. It will not allow us to 
really assist users in identifying what are very likely to be mistakes in 
their schemas, because what we call 'delimited' is much too permissive.

I am not going to argue to change this at the current time. This is just 
due to the need to converge on the standard. My preference would be that 
lengthKind 'delimited' requires scannability (as uniform text), and that 
some new keyword be used to mean the current algorithm as specified. 

I do suggest that we rename the characteristic scannability to 
scannable-as-text, in order to make it clearer what the requirement is, 
and clarify that this is a requirement on the schema component.

I suspect that lengthKind='delimited' will perhaps someday (DFDL v2.0?) be 
deprecated and/or augmented by other keywords that are more specific such 
as "delimitedText" which means scanning for delimiters over only text in 
one encoding (the 'scannable' restriction), and other keywords like 
"delimitedBinary" or "delimitedMixed" meaning formats which admit the more 
powerful and complex things. 

My argument is simple:  if people have something like delimiters in 
distinct encodings appearing in their schema, it is most-likely due to a 
human error (they missed one place when they were editing the encoding 
property), rather than something as obscure as this 
delimiters-in-a-different-encoding being a real characteristic of the 
data. An SDE here will help them find this error. 

Furthermore, if you truly want a string to be terminated by either an 
ebcdic comma (6B) or an ascii linefeed (0A), then you have a few 
alternatives before you have to go to the now optional feature of 
'rawbyte', and which are consistent with scannability. 

First, use encoding ascii, and specify terminators of %LF;, and 'k' (or 
%#x6B; since 'k' is 6B ascii), in which case the string is assumed to 
contain only ascii code units. In that case, if ebcdic 'a' (x81 - illegal 
in ascii) is encountered in the data, you will either get an error or a 
unicode replacement character depending on encodingErrorPolicy. If you 
know your data will only ascii-legal code units, then this is a good 
solution. 

I would note that your example (previously in the thread) did not specify 
an encoding for the innermost string elements. Did you intend for those to 
be ebcdic or ascii? 

Alternatively you can specify encoding ebcdic-cp-us, specify terminator of 
comma (codepoint x6B) and %#x0A; which is the RPT control code in ebcdic  
but not an illegal character. In that case the string can contain any 
legal ebcdic code point. However, if code unit x70 is encountered 
(corresponds to an ascii 'p', but unmapped in ebcdic-cp-us), you will 
either get an error, or a unicode replacement character depending on 
encodingErrorPolicy. If you know your data will contain only legal ebcdic 
code units, then this is a good solution. 

Finally, you can specify encoding iso-8859-1, and terminators %#x6B; (or 
'k' which is 6B in iso-8859-1) and %#x0A; (linefeed in  iso-8859-1). Then 
any code units at all will be legal as all bytes have corresponding 
character codepoints in iso-8859-1. If you have no idea what the data is, 
but just want some sort of string out of it, and know only that the 
terminators are these bytes, then this is a good solution. If your data 
contains, for example, packed-decimal numbers, then this is the only way 
to safely scan past it as a string, because both ebcdic and ascii have 
unmapped code points from the code units that could appear in 
packed-decimal number bytes.

All the above are consistent with implementations that attempt to convert 
data from code-units (bytes) to Unicode codepoints using a character set 
decoder, and then apply any scanning/searching only in the Unicode realm, 
and work in one pass over the data. 

I currently think that to implement both rawbytes and encodingErrorPolicy 
it will require two passes over the data. I expect this overhead to be 
negligible, so I'm not worried about it really. 

...mike

On Mon, Jul 29, 2013 at 7:18 AM, Tim Kimber <KIMBERT@uk.ibm.com> wrote:
Hi Mike, 

Good - that was almost exactly the reply that I was expecting. I now 
understand exactly where you are coming from, and how we arrived at this 
position. 

First, a few statements that I think are true. I want to establish some 
basic ground rules before we decide how to go forward: 
a) it is desirable for DFDL ( more accurately, a valid subset of DFDL ) to 
be implementable using well known parsing techniques. I think that pretty 
much means regular expressions and BNF-style grammars. That implies that 
it might be possible to implement a DFDL parser using one of the 
well-known parser-generator technologies like Yacc/Bison/JavaCC/Antlr. I'm 
not claiming that it *is* possible, but I think it would be a good thing 
if it was. 
b) It is technically feasible to implement a DFDL parser that can handle 
the mixed-encoding example using regex technology. However, it would not 
be a very efficient implementation because the scanner would have to scan 
the data once per encoding, and it would have to do that for every 
character after the end of field1. 
c) It is possible to produce an efficient scanner that handles mixed 
encodings. Such a scanner cannot use regex technology for scanning - in 
fact, I think the only efficient implementation is to convert all 
terminating markup into byte sequences and then perform all scanning in 
the byte domain. This is what the IBM implementation does. 

The scenario in my example is not entirely far-fetched - it is conceivable 
that the encoding might change mid-way through a document, and I think 
Steve came up with a real-world format that required this ( I have a hazy 
memory of discussing this a couple of years ago ). The requirement for the 
raw byte entity is not for this case - it is for the case where the 
delimiters are genuinely not characters ( e.g. a UTF-16 data stream 
terminated by a single null byte ). However, it is not easy to come up 
with realistic examples where the raw byte entity could not be translated 
into a character before the processor uses it. I think that's where some 
of the confusion has arisen. 

We have already agreed to make the raw byte entity an optional feature. We 
should consider disallowing mixed encodings when lengthKind is delimited. 
If we cannot do that then I agree with Mike that the descriptions and 
definitions in the spec need to be made a bit clearer. 

regards,

Tim Kimber, DFDL Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742  
Internal tel. 37246742

From:        Mike Beckerle <mbeckerle.dfdl@gmail.com> 
To:        Tim Kimber/UK/IBM@IBMGB, 
Cc:        Steve Hanson/UK/IBM@IBMGB 
Date:        26/07/2013 22:49 
Subject:        Re: clarification needed for scan and scannable 

Clearly we have to rework the description and definitions because my 
understanding of the current spec would say your schema is not valid 
because it is not scannable. Everything in the scope of the enclosing 
element which is delimited must be scannable and that means uniform 
encoding. 
It is exactly to rule out this mixed encoding ambiguity that we have the 
restriction. 
I have no idea how to implement delimiters without this restriction. 
I think the case you have here is what raw bytes are for. Though I am no 
longer clear on how to implement rawbytes either. 
On Jul 25, 2013 6:40 PM, "Tim Kimber" <KIMBERT@uk.ibm.com> wrote: 
Suppose I have this DFDL XSD: 

<xs:element name="documentRoot" dfdl:encoding="US-ASCII" dfdl:lengthKind=
"delimited" dfdl:terminator="]]]">
 <xs:complexType>
   <xs:sequence>
     <xs:element name="delimitedRecord" maxOccurs="unbounded" 
                 dfdl:encoding="US-ASCII" dfdl:lengthKind="delimited" 
dfdl:terminator="%LF;">
       <xs:complexType>
         <xs:sequence dfdl:separator="," dfdl:separatorSuppressionPolicy=
"suppressedAtEndLax" dfdl:encoding="EBCDIC-US">
           <xs:element name="field1" type="xs:string" dfdl:lengthKind=
"delimited">
           <xs:element name="field2" type="xs:string" dfdl:lengthKind=
"delimited" minOccurs="0">
           <xs:element name="field3" type="xs:string" dfdl:lengthKind=
"delimited" minOccurs="0">
         </xs:sequence>
       </xs:complexType>
     </xs:element>
   </xs:sequence>
 </xs:complexType>
</xs:element>

...which will parse some data like this: 
field1Value,field2Value,field3Value 
field1Value, 
field1Value,field2Value]]] 
...except that the commas that delimit the fields are in EBCDIC and the 
linefeeds that delimit the records are in ASCII . 

I have purposely constructed this example with separatorSuppressionPolicy 
set to 'suppressedAtEndLax'. This means that field1 and field2 could both 
be terminated by either of 
a) an EBCDIC comma or 
b) an ASCII line feed 

This is a very artificial example, but it is valid DFDL and should not 
produce schema definition error. If we use the term 'scan' when discussing 
lengthKind='delimited' then we need to be careful how to define the term 
'scannable' - otherwise we might appear to prohibit things that are 
actually valid. I think this is the same point that Steve was making. 

Most implementations will do the same as Daffodil and will use a regex 
engine to implement all types of scanning, including 
lengthKind='delimited'. It's the most natural solution, and it can be made 
to work as long as 
a) the implementation takes extra care if/when the encoding changes within 
a component and 
b) the implementation either does not support the raw byte entity, or only 
supports it when it can be translated to a valid character in the 
component's encoding. 

There is huge scope for confusion in this area - most readers of the DFDL 
specification will assume that delimited scanning can be implemented using 
regex technology + EBNF grammar. It may be possible ( I'm not claiming 
that it is btw ), but the grammar would be decidedly non-trivial. so the 
DFDL specification should take extra care to be unambiguous when 
discussing how to scan for terminating markup. 

regards,

Tim Kimber, DFDL Team,
Hursley, UK
Internet:  kimbert@uk.ibm.com
Tel. 01962-816742  
Internal tel. 37246742

From:        Mike Beckerle <mbeckerle.dfdl@gmail.com> 
To:        Steve Hanson/UK/IBM@IBMGB, 
Cc:        Tim Kimber/UK/IBM@IBMGB 
Date:        25/07/2013 19:04 
Subject:        clarification needed for scan and scannable 

 If a user has read the Glossary and noted the definition of 'scannable' 
then when he sees the terms 'scanning' and 'scanned' in 12.3.2 he may 
think that this implies data being read by lengthKind must be 'scannable', 
and it is not so.  That's how I read the spec, hence my original comment. 
I find it confusing to define 'scan' and then not to define  'scannable' 
as 'able to be scanned'. 
Ok, so i am rethinking how we express what we mean by scan.

scan - verb. The action of attempting to find something in characters. 
Implies decoding the characters from code units to character codes.

A scan can succeed on real data, or fail (no match, not found), but both 
are normal behaviors for a scan. 

However, this is predicated on the data at least being meaningfully 
decoded into characters of a single character set encoding. This is 
because our regex technology isn't expected to be able to shift encodings 
mid match/scan, nor is it expected to be able to jump over/around things 
that aren't textual so as not to misinterpret them as code units of 
characters.

So saying a schema component is scannable - means the schema expresses the 
assumption that the data is expected to be all textual/characters in a 
uniform encoding so that it is meaningful to talk about scanning it with 
the kind of regex technology we have available today. That is, there's no 
need to consider encoding changes mid match, nor what data to jump 
over/around. 

This is the sense of "scan" -able that we use the term to mean. Our 
expectation as expressed in the schema, is that so long as the data 
decodes without error the scan will either succeed or fail normally. 

Actual data is scannable if it meets this expectation. 

Contrast with what happens for an assert with testKind='pattern'. In that 
case we scan, but DFDL processors aren't required to provide any guarantee 
that the data is scannable, which means the designer of that regex pattern 
must really know the data, and know what they are doing, as to avoid the 
issues of decode-errors. 

Why do we do this? Because we don't want to have to refactor a complex 
type element into two sequences one of which is the stuff at the front 
that an assert with test pattern examines, and the other of which is the 
stuff after that. In some situations it may be very awkward or impossible 
to separate out the part at the front that we need to scrutinize. Instead 
we just allow you to put the assert at the beginning, and it is up to the 
regex designer to know that the first few fields it looks at the data for, 
are scannable (in the sense of the expectation of them being text), or if 
not scannable in terms of that expectation from the schema, then at least 
that the actual contents of them will not have data that interpreted as 
character code points, will cause decode errors. In other words the writer 
of a test pattern has to know that either the data is scannable (schema 
says so), or that the actual data will actually be scan-able, in that 
decode errors won't occur if it is decoded. 

This also provides a backdoor by which one can use regex technology to 
match against binary data. But then you really have to understand the data 
so as to understand exactly what code units will be found in it, so one 
can understand what the regex technology will do. One can even play games 
with encoding="iso-8859-1", where no decoding errors are possible, so as 
to guarantee no decode errors when scanning binary data that really 
shouldn't be thought of as text. 

The part of the data that a test pattern of an assert actually looks at 
needs to be scannable by schema expectation, or scannable in actual data 
contents.

If you write a test pattern, expecting to match things in what will turn 
out to be the first and second elements of a complex type, but the 
encoding is different for the second element, then your test pattern isn't 
going to work (you may get false matches, false non-matches, or may get 
decode errors) because the assumption that the data is scannable for that 
first and second element, is violated. 

 Summary: can we just say 

scan - verb
scannable - a schema component is scannable if ... current definition. 
Data is scannable with respect to a specific character set encoding if, 
when it is interpreted as a sequence of code units, they always decode 
without error. 

There are two ways to make "any data" scannable for certain. One - set 
encodingErrorPolicy to 'replace'. Now all data will decode without error 
because you get substitution characters insteead.

Two - change encoding to iso-8859-1 or other supported encoding where 
every byte is a legal code unit.

If you are in neither situation One or Two above, then you have to say the 
data is scannable, but with respect to a specific encoding. So saying the 
data is scannable as ASCII means that it will not contain any bytes that 
interpreted as ASCII code units will be illegal. For this example, that 
means all the bytes have values from 0 to 127 (high bit not set). If you 
know that will be true of your data, then even if it is binary data you 
can say it is scannable as ASCII.

The representation of Packed Decimal numbers are not scannable as ascii, 
for example, because any high-nibble digit 8 or greater will cause the 
byte containing it to have a value 128 or higher, for which there is no 
corresponding ascii code unit.

Packed Decimal numbers are also not scannable as UTF-8, because many 
packed decimal bytes will not be legal utf-8 code unit values, nor will 
they be in the ordered arrangements UTF-8 requires.

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU 

-- 
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU

Steve Hanson

tags

participants (1)