Re: [DFDL-WG] clarification needed for scan and scannable

Action 217 raised to decide new terminology for the regex scanning encoding requirement. Regards Steve Hanson Architect, IBM Data Format Description Language (DFDL) Co-Chair, OGF DFDL Working Group IBM SWG, Hursley, UK smh@uk.ibm.com tel:+44-1962-815848 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Tim Kimber/UK/IBM@IBMGB, Cc: Steve Hanson/UK/IBM@IBMGB Date: 30/07/2013 01:01 Subject: Re: clarification needed for scan and scannable I reviewed the spec. I have to reverse my prior statement. What you describe is definitely currently allowed by the spec. The 'scannable' restriction is reserved for lengthKind='pattern'. In my view this is a terrible idea, but it is where it is. I.e., that the lengthKind 'delimited' doesn't require 'scannable' characteristic throughout everything that is contained within it. It will not allow us to really assist users in identifying what are very likely to be mistakes in their schemas, because what we call 'delimited' is much too permissive. I am not going to argue to change this at the current time. This is just due to the need to converge on the standard. My preference would be that lengthKind 'delimited' requires scannability (as uniform text), and that some new keyword be used to mean the current algorithm as specified. I do suggest that we rename the characteristic scannability to scannable-as-text, in order to make it clearer what the requirement is, and clarify that this is a requirement on the schema component. I suspect that lengthKind='delimited' will perhaps someday (DFDL v2.0?) be deprecated and/or augmented by other keywords that are more specific such as "delimitedText" which means scanning for delimiters over only text in one encoding (the 'scannable' restriction), and other keywords like "delimitedBinary" or "delimitedMixed" meaning formats which admit the more powerful and complex things. My argument is simple: if people have something like delimiters in distinct encodings appearing in their schema, it is most-likely due to a human error (they missed one place when they were editing the encoding property), rather than something as obscure as this delimiters-in-a-different-encoding being a real characteristic of the data. An SDE here will help them find this error. Furthermore, if you truly want a string to be terminated by either an ebcdic comma (6B) or an ascii linefeed (0A), then you have a few alternatives before you have to go to the now optional feature of 'rawbyte', and which are consistent with scannability. First, use encoding ascii, and specify terminators of %LF;, and 'k' (or %#x6B; since 'k' is 6B ascii), in which case the string is assumed to contain only ascii code units. In that case, if ebcdic 'a' (x81 - illegal in ascii) is encountered in the data, you will either get an error or a unicode replacement character depending on encodingErrorPolicy. If you know your data will only ascii-legal code units, then this is a good solution. I would note that your example (previously in the thread) did not specify an encoding for the innermost string elements. Did you intend for those to be ebcdic or ascii? Alternatively you can specify encoding ebcdic-cp-us, specify terminator of comma (codepoint x6B) and %#x0A; which is the RPT control code in ebcdic but not an illegal character. In that case the string can contain any legal ebcdic code point. However, if code unit x70 is encountered (corresponds to an ascii 'p', but unmapped in ebcdic-cp-us), you will either get an error, or a unicode replacement character depending on encodingErrorPolicy. If you know your data will contain only legal ebcdic code units, then this is a good solution. Finally, you can specify encoding iso-8859-1, and terminators %#x6B; (or 'k' which is 6B in iso-8859-1) and %#x0A; (linefeed in iso-8859-1). Then any code units at all will be legal as all bytes have corresponding character codepoints in iso-8859-1. If you have no idea what the data is, but just want some sort of string out of it, and know only that the terminators are these bytes, then this is a good solution. If your data contains, for example, packed-decimal numbers, then this is the only way to safely scan past it as a string, because both ebcdic and ascii have unmapped code points from the code units that could appear in packed-decimal number bytes. All the above are consistent with implementations that attempt to convert data from code-units (bytes) to Unicode codepoints using a character set decoder, and then apply any scanning/searching only in the Unicode realm, and work in one pass over the data. I currently think that to implement both rawbytes and encodingErrorPolicy it will require two passes over the data. I expect this overhead to be negligible, so I'm not worried about it really. ...mike On Mon, Jul 29, 2013 at 7:18 AM, Tim Kimber <KIMBERT@uk.ibm.com> wrote: Hi Mike, Good - that was almost exactly the reply that I was expecting. I now understand exactly where you are coming from, and how we arrived at this position. First, a few statements that I think are true. I want to establish some basic ground rules before we decide how to go forward: a) it is desirable for DFDL ( more accurately, a valid subset of DFDL ) to be implementable using well known parsing techniques. I think that pretty much means regular expressions and BNF-style grammars. That implies that it might be possible to implement a DFDL parser using one of the well-known parser-generator technologies like Yacc/Bison/JavaCC/Antlr. I'm not claiming that it *is* possible, but I think it would be a good thing if it was. b) It is technically feasible to implement a DFDL parser that can handle the mixed-encoding example using regex technology. However, it would not be a very efficient implementation because the scanner would have to scan the data once per encoding, and it would have to do that for every character after the end of field1. c) It is possible to produce an efficient scanner that handles mixed encodings. Such a scanner cannot use regex technology for scanning - in fact, I think the only efficient implementation is to convert all terminating markup into byte sequences and then perform all scanning in the byte domain. This is what the IBM implementation does. The scenario in my example is not entirely far-fetched - it is conceivable that the encoding might change mid-way through a document, and I think Steve came up with a real-world format that required this ( I have a hazy memory of discussing this a couple of years ago ). The requirement for the raw byte entity is not for this case - it is for the case where the delimiters are genuinely not characters ( e.g. a UTF-16 data stream terminated by a single null byte ). However, it is not easy to come up with realistic examples where the raw byte entity could not be translated into a character before the processor uses it. I think that's where some of the confusion has arisen. We have already agreed to make the raw byte entity an optional feature. We should consider disallowing mixed encodings when lengthKind is delimited. If we cannot do that then I agree with Mike that the descriptions and definitions in the spec need to be made a bit clearer. regards, Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Tim Kimber/UK/IBM@IBMGB, Cc: Steve Hanson/UK/IBM@IBMGB Date: 26/07/2013 22:49 Subject: Re: clarification needed for scan and scannable Clearly we have to rework the description and definitions because my understanding of the current spec would say your schema is not valid because it is not scannable. Everything in the scope of the enclosing element which is delimited must be scannable and that means uniform encoding. It is exactly to rule out this mixed encoding ambiguity that we have the restriction. I have no idea how to implement delimiters without this restriction. I think the case you have here is what raw bytes are for. Though I am no longer clear on how to implement rawbytes either. On Jul 25, 2013 6:40 PM, "Tim Kimber" <KIMBERT@uk.ibm.com> wrote: Suppose I have this DFDL XSD: <xs:element name="documentRoot" dfdl:encoding="US-ASCII" dfdl:lengthKind= "delimited" dfdl:terminator="]]]"> <xs:complexType> <xs:sequence> <xs:element name="delimitedRecord" maxOccurs="unbounded" dfdl:encoding="US-ASCII" dfdl:lengthKind="delimited" dfdl:terminator="%LF;"> <xs:complexType> <xs:sequence dfdl:separator="," dfdl:separatorSuppressionPolicy= "suppressedAtEndLax" dfdl:encoding="EBCDIC-US"> <xs:element name="field1" type="xs:string" dfdl:lengthKind= "delimited"> <xs:element name="field2" type="xs:string" dfdl:lengthKind= "delimited" minOccurs="0"> <xs:element name="field3" type="xs:string" dfdl:lengthKind= "delimited" minOccurs="0"> </xs:sequence> </xs:complexType> </xs:element> </xs:sequence> </xs:complexType> </xs:element> ...which will parse some data like this: field1Value,field2Value,field3Value field1Value, field1Value,field2Value]]] ...except that the commas that delimit the fields are in EBCDIC and the linefeeds that delimit the records are in ASCII . I have purposely constructed this example with separatorSuppressionPolicy set to 'suppressedAtEndLax'. This means that field1 and field2 could both be terminated by either of a) an EBCDIC comma or b) an ASCII line feed This is a very artificial example, but it is valid DFDL and should not produce schema definition error. If we use the term 'scan' when discussing lengthKind='delimited' then we need to be careful how to define the term 'scannable' - otherwise we might appear to prohibit things that are actually valid. I think this is the same point that Steve was making. Most implementations will do the same as Daffodil and will use a regex engine to implement all types of scanning, including lengthKind='delimited'. It's the most natural solution, and it can be made to work as long as a) the implementation takes extra care if/when the encoding changes within a component and b) the implementation either does not support the raw byte entity, or only supports it when it can be translated to a valid character in the component's encoding. There is huge scope for confusion in this area - most readers of the DFDL specification will assume that delimited scanning can be implemented using regex technology + EBNF grammar. It may be possible ( I'm not claiming that it is btw ), but the grammar would be decidedly non-trivial. so the DFDL specification should take extra care to be unambiguous when discussing how to scan for terminating markup. regards, Tim Kimber, DFDL Team, Hursley, UK Internet: kimbert@uk.ibm.com Tel. 01962-816742 Internal tel. 37246742 From: Mike Beckerle <mbeckerle.dfdl@gmail.com> To: Steve Hanson/UK/IBM@IBMGB, Cc: Tim Kimber/UK/IBM@IBMGB Date: 25/07/2013 19:04 Subject: clarification needed for scan and scannable If a user has read the Glossary and noted the definition of 'scannable' then when he sees the terms 'scanning' and 'scanned' in 12.3.2 he may think that this implies data being read by lengthKind must be 'scannable', and it is not so. That's how I read the spec, hence my original comment. I find it confusing to define 'scan' and then not to define 'scannable' as 'able to be scanned'. Ok, so i am rethinking how we express what we mean by scan. scan - verb. The action of attempting to find something in characters. Implies decoding the characters from code units to character codes. A scan can succeed on real data, or fail (no match, not found), but both are normal behaviors for a scan. However, this is predicated on the data at least being meaningfully decoded into characters of a single character set encoding. This is because our regex technology isn't expected to be able to shift encodings mid match/scan, nor is it expected to be able to jump over/around things that aren't textual so as not to misinterpret them as code units of characters. So saying a schema component is scannable - means the schema expresses the assumption that the data is expected to be all textual/characters in a uniform encoding so that it is meaningful to talk about scanning it with the kind of regex technology we have available today. That is, there's no need to consider encoding changes mid match, nor what data to jump over/around. This is the sense of "scan" -able that we use the term to mean. Our expectation as expressed in the schema, is that so long as the data decodes without error the scan will either succeed or fail normally. Actual data is scannable if it meets this expectation. Contrast with what happens for an assert with testKind='pattern'. In that case we scan, but DFDL processors aren't required to provide any guarantee that the data is scannable, which means the designer of that regex pattern must really know the data, and know what they are doing, as to avoid the issues of decode-errors. Why do we do this? Because we don't want to have to refactor a complex type element into two sequences one of which is the stuff at the front that an assert with test pattern examines, and the other of which is the stuff after that. In some situations it may be very awkward or impossible to separate out the part at the front that we need to scrutinize. Instead we just allow you to put the assert at the beginning, and it is up to the regex designer to know that the first few fields it looks at the data for, are scannable (in the sense of the expectation of them being text), or if not scannable in terms of that expectation from the schema, then at least that the actual contents of them will not have data that interpreted as character code points, will cause decode errors. In other words the writer of a test pattern has to know that either the data is scannable (schema says so), or that the actual data will actually be scan-able, in that decode errors won't occur if it is decoded. This also provides a backdoor by which one can use regex technology to match against binary data. But then you really have to understand the data so as to understand exactly what code units will be found in it, so one can understand what the regex technology will do. One can even play games with encoding="iso-8859-1", where no decoding errors are possible, so as to guarantee no decode errors when scanning binary data that really shouldn't be thought of as text. The part of the data that a test pattern of an assert actually looks at needs to be scannable by schema expectation, or scannable in actual data contents. If you write a test pattern, expecting to match things in what will turn out to be the first and second elements of a complex type, but the encoding is different for the second element, then your test pattern isn't going to work (you may get false matches, false non-matches, or may get decode errors) because the assumption that the data is scannable for that first and second element, is violated. Summary: can we just say scan - verb scannable - a schema component is scannable if ... current definition. Data is scannable with respect to a specific character set encoding if, when it is interpreted as a sequence of code units, they always decode without error. There are two ways to make "any data" scannable for certain. One - set encodingErrorPolicy to 'replace'. Now all data will decode without error because you get substitution characters insteead. Two - change encoding to iso-8859-1 or other supported encoding where every byte is a legal code unit. If you are in neither situation One or Two above, then you have to say the data is scannable, but with respect to a specific encoding. So saying the data is scannable as ASCII means that it will not contain any bytes that interpreted as ASCII code units will be illegal. For this example, that means all the bytes have values from 0 to 127 (high bit not set). If you know that will be true of your data, then even if it is binary data you can say it is scannable as ASCII. The representation of Packed Decimal numbers are not scannable as ascii, for example, because any high-nibble digit 8 or greater will cause the byte containing it to have a value 128 or higher, for which there is no corresponding ascii code unit. Packed Decimal numbers are also not scannable as UTF-8, because many packed decimal bytes will not be legal utf-8 code unit values, nor will they be in the ordered arrangements UTF-8 requires. Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU -- Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com Unless stated otherwise above: IBM United Kingdom Limited - Registered in England and Wales with number 741598. Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU
participants (1)
-
Steve Hanson