Mike
I think this is best handled by updating
existing errata 2.119, and updating 3.9 (not 2.9).
I've attached an updated errata document
that has corrected some typos etc, but which does not address the above.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB,
Cc:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date:
24/08/2013 18:34
Subject:
Action 217 -
scannable - Re: [DFDL-WG] clarification needed for scan and scannable
I propose this new erratum, and corresponding edits to
erratum 2.9.
In the draft r14.3 (to be circulated soon), I have OPEN
comment bubbles to review the impact of this change, but I have edited
this stuff in, as you really have to see it in action to see if it "works
for you".
Errata 2.155 Sections 3, 7.3.1, 7.3.2,
12.3.5. Scan, scannable, scannable-as-text
These terms all added to/changed in the glossary.
Definitions removed from the prose. Scannable now means able to scan, which
is natural. More specific term scannable-as-text used when we want the
recursive requirement of uniform encoding.
Errata 2.9 updated to use term scannable-as-text.
On Wed, Aug 14, 2013 at 5:45 AM, Steve Hanson <smh@uk.ibm.com>
wrote:
Action 217 raised to decide new terminology
for the regex scanning encoding requirement.
Regards
Steve Hanson
Architect, IBM Data Format Description Language (DFDL)
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: Tim
Kimber/UK/IBM@IBMGB,
Cc: Steve
Hanson/UK/IBM@IBMGB
Date: 30/07/2013
01:01
Subject: Re:
clarification needed for scan and scannable
I reviewed the spec. I have to reverse my prior statement. What you describe
is definitely currently allowed by the spec. The 'scannable' restriction
is reserved for lengthKind='pattern'.
In my view this is a terrible idea, but it is where it is. I.e., that the
lengthKind 'delimited' doesn't require 'scannable' characteristic throughout
everything that is contained within it. It will not allow us to really
assist users in identifying what are very likely to be mistakes in their
schemas, because what we call 'delimited' is much too permissive.
I am not going to argue to change this at the current time. This is just
due to the need to converge on the standard. My preference would be that
lengthKind 'delimited' requires scannability (as uniform text), and that
some new keyword be used to mean the current algorithm as specified.
I do suggest that we rename the characteristic scannability to scannable-as-text,
in order to make it clearer what the requirement is, and clarify that this
is a requirement on the schema component.
I suspect that lengthKind='delimited' will perhaps someday (DFDL v2.0?)
be deprecated and/or augmented by other keywords that are more specific
such as "delimitedText" which means scanning for delimiters over
only text in one encoding (the 'scannable' restriction), and other keywords
like "delimitedBinary" or "delimitedMixed" meaning
formats which admit the more powerful and complex things.
My argument is simple: if people have something like delimiters in
distinct encodings appearing in their schema, it is most-likely due to
a human error (they missed one place when they were editing the encoding
property), rather than something as obscure as this delimiters-in-a-different-encoding
being a real characteristic of the data. An SDE here will help them find
this error.
Furthermore, if you truly want a string to be terminated by either an ebcdic
comma (6B) or an ascii linefeed (0A), then you have a few alternatives
before you have to go to the now optional feature of 'rawbyte', and which
are consistent with scannability.
First, use encoding ascii, and specify terminators of %LF;, and 'k' (or
%#x6B; since 'k' is 6B ascii), in which case the string is assumed to contain
only ascii code units. In that case, if ebcdic 'a' (x81 - illegal in ascii)
is encountered in the data, you will either get an error or a unicode replacement
character depending on encodingErrorPolicy. If you know your data will
only ascii-legal code units, then this is a good solution.
I would note that your example (previously in the thread) did not specify
an encoding for the innermost string elements. Did you intend for those
to be ebcdic or ascii?
Alternatively you can specify encoding ebcdic-cp-us, specify terminator
of comma (codepoint x6B) and %#x0A; which is the RPT control code in ebcdic
but not an illegal character. In that case the string can contain any legal
ebcdic code point. However, if code unit x70 is encountered (corresponds
to an ascii 'p', but unmapped in ebcdic-cp-us), you will either get an
error, or a unicode replacement character depending on encodingErrorPolicy.
If you know your data will contain only legal ebcdic code units, then this
is a good solution.
Finally, you can specify encoding iso-8859-1, and terminators %#x6B; (or
'k' which is 6B in iso-8859-1) and %#x0A; (linefeed in iso-8859-1).
Then any code units at all will be legal as all bytes have corresponding
character codepoints in iso-8859-1. If you have no idea what the data is,
but just want some sort of string out of it, and know only that the terminators
are these bytes, then this is a good solution. If your data contains, for
example, packed-decimal numbers, then this is the only way to safely scan
past it as a string, because both ebcdic and ascii have unmapped code points
from the code units that could appear in packed-decimal number bytes.
All the above are consistent with implementations that attempt to convert
data from code-units (bytes) to Unicode codepoints using a character set
decoder, and then apply any scanning/searching only in the Unicode realm,
and work in one pass over the data.
I currently think that to implement both rawbytes and encodingErrorPolicy
it will require two passes over the data. I expect this overhead to be
negligible, so I'm not worried about it really.
...mike
On Mon, Jul 29, 2013 at 7:18 AM, Tim Kimber <KIMBERT@uk.ibm.com>
wrote:
Hi Mike,
Good - that was almost exactly the reply that I was expecting. I now understand
exactly where you are coming from, and how we arrived at this position.
First, a few statements that I think are true. I want to establish some
basic ground rules before we decide how to go forward:
a) it is desirable for DFDL ( more accurately, a valid subset of DFDL )
to be implementable using well known parsing techniques. I think that pretty
much means regular expressions and BNF-style grammars. That implies that
it might be possible to implement a DFDL parser using one of the well-known
parser-generator technologies like Yacc/Bison/JavaCC/Antlr. I'm not claiming
that it *is* possible, but I think it would be a good thing if it was.
b) It is technically feasible to implement a DFDL parser that can handle
the mixed-encoding example using regex technology. However, it would not
be a very efficient implementation because the scanner would have to scan
the data once per encoding, and it would have to do that for every character
after the end of field1.
c) It is possible to produce an efficient scanner that handles mixed encodings.
Such a scanner cannot use regex technology for scanning - in fact, I think
the only efficient implementation is to convert all terminating markup
into byte sequences and then perform all scanning in the byte domain. This
is what the IBM implementation does.
The scenario in my example is not entirely far-fetched - it is conceivable
that the encoding might change mid-way through a document, and I think
Steve came up with a real-world format that required this ( I have a hazy
memory of discussing this a couple of years ago ). The requirement for
the raw byte entity is not for this case - it is for the case where the
delimiters are genuinely not characters ( e.g. a UTF-16 data stream terminated
by a single null byte ). However, it is not easy to come up with realistic
examples where the raw byte entity could not be translated into a character
before the processor uses it. I think that's where some of the confusion
has arisen.
We have already agreed to make the raw byte entity an optional feature.
We should consider disallowing mixed encodings when lengthKind is delimited.
If we cannot do that then I agree with Mike that the descriptions and definitions
in the spec need to be made a bit clearer.
regards,
Tim Kimber, DFDL Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 37246742
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: Tim
Kimber/UK/IBM@IBMGB,
Cc: Steve
Hanson/UK/IBM@IBMGB
Date: 26/07/2013
22:49
Subject: Re:
clarification needed for scan and scannable
Clearly we have to rework the description and definitions because my understanding
of the current spec would say your schema is not valid because it is not
scannable. Everything in the scope of the enclosing element which is delimited
must be scannable and that means uniform encoding.
It is exactly to rule out this mixed encoding ambiguity that we have the
restriction.
I have no idea how to implement delimiters without this
restriction.
I think the case you have here is what raw bytes are for.
Though I am no longer clear on how to implement rawbytes either.
On Jul 25, 2013 6:40 PM, "Tim Kimber" <KIMBERT@uk.ibm.com>
wrote:
Suppose I have this DFDL XSD:
<xs:element
name="documentRoot"
dfdl:encoding="US-ASCII"
dfdl:lengthKind="delimited"
dfdl:terminator="]]]">
<xs:complexType>
<xs:sequence>
<xs:element
name="delimitedRecord"
maxOccurs="unbounded"
dfdl:encoding="US-ASCII"
dfdl:lengthKind="delimited"
dfdl:terminator="%LF;">
<xs:complexType>
<xs:sequence
dfdl:separator=","
dfdl:separatorSuppressionPolicy="suppressedAtEndLax"
dfdl:encoding="EBCDIC-US">
<xs:element
name="field1"
type="xs:string"
dfdl:lengthKind="delimited">
<xs:element
name="field2"
type="xs:string"
dfdl:lengthKind="delimited"
minOccurs="0">
<xs:element
name="field3"
type="xs:string"
dfdl:lengthKind="delimited"
minOccurs="0">
</xs:sequence>
</xs:complexType>
</xs:element>
</xs:sequence>
</xs:complexType>
</xs:element>
...which will parse some data like this:
field1Value,field2Value,field3Value
field1Value,
field1Value,field2Value]]]
...except that the commas that delimit the fields are in EBCDIC and the
linefeeds that delimit the records are in ASCII .
I have purposely constructed this example with separatorSuppressionPolicy
set to 'suppressedAtEndLax'. This means that field1 and field2 could both
be terminated by either of
a) an EBCDIC comma or
b) an ASCII line feed
This is a very artificial example, but it is valid DFDL and should not
produce schema definition error. If we use the term 'scan' when discussing
lengthKind='delimited' then we need to be careful how to define the term
'scannable' - otherwise we might appear to prohibit things that are actually
valid. I think this is the same point that Steve was making.
Most implementations will do the same as Daffodil and will use a regex
engine to implement all types of scanning, including lengthKind='delimited'.
It's the most natural solution, and it can be made to work as long as
a) the implementation takes extra care if/when the encoding changes within
a component and
b) the implementation either does not support the raw byte entity, or only
supports it when it can be translated to a valid character in the component's
encoding.
There is huge scope for confusion in this area - most readers of the DFDL
specification will assume that delimited scanning can be implemented using
regex technology + EBNF grammar. It may be possible ( I'm not claiming
that it is btw ), but the grammar would be decidedly non-trivial. so the
DFDL specification should take extra care to be unambiguous when discussing
how to scan for terminating markup.
regards,
Tim Kimber, DFDL Team,
Hursley, UK
Internet: kimbert@uk.ibm.com
Tel. 01962-816742
Internal tel. 37246742
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: Steve
Hanson/UK/IBM@IBMGB,
Cc: Tim
Kimber/UK/IBM@IBMGB
Date: 25/07/2013
19:04
Subject: clarification
needed for scan and scannable
If a user has read the Glossary and noted the definition of 'scannable'
then when he sees the terms 'scanning' and 'scanned' in 12.3.2 he may think
that this implies data being read by lengthKind must be 'scannable', and
it is not so. That's how I read the spec, hence my original comment.
I find it confusing to define 'scan' and then not to define 'scannable'
as 'able to be scanned'.
Ok, so i am rethinking how we express what we mean by scan.
scan - verb. The action of attempting to find something in characters.
Implies decoding the characters from code units to character codes.
A scan can succeed on real data, or fail (no match, not found), but both
are normal behaviors for a scan.
However, this is predicated on the data at least being meaningfully decoded
into characters of a single character set encoding. This is because our
regex technology isn't expected to be able to shift encodings mid match/scan,
nor is it expected to be able to jump over/around things that aren't textual
so as not to misinterpret them as code units of characters.
So saying a schema component is scannable - means the schema expresses
the assumption that the data is expected to be all textual/characters in
a uniform encoding so that it is meaningful to talk about scanning it with
the kind of regex technology we have available today. That is, there's
no need to consider encoding changes mid match, nor what data to jump over/around.
This is the sense of "scan" -able that we use the term to mean.
Our expectation as expressed in the schema, is that so long as the data
decodes without error the scan will either succeed or fail normally.
Actual data is scannable if it meets this expectation.
Contrast with what happens for an assert with testKind='pattern'. In that
case we scan, but DFDL processors aren't required to provide any guarantee
that the data is scannable, which means the designer of that regex pattern
must really know the data, and know what they are doing, as to avoid the
issues of decode-errors.
Why do we do this? Because we don't want to have to refactor a complex
type element into two sequences one of which is the stuff at the front
that an assert with test pattern examines, and the other of which is the
stuff after that. In some situations it may be very awkward or impossible
to separate out the part at the front that we need to scrutinize. Instead
we just allow you to put the assert at the beginning, and it is up to the
regex designer to know that the first few fields it looks at the data for,
are scannable (in the sense of the expectation of them being text), or
if not scannable in terms of that expectation from the schema, then at
least that the actual contents of them will not have data that interpreted
as character code points, will cause decode errors. In other words the
writer of a test pattern has to know that either the data is scannable
(schema says so), or that the actual data will actually be scan-able, in
that decode errors won't occur if it is decoded.
This also provides a backdoor by which one can use regex technology to
match against binary data. But then you really have to understand the data
so as to understand exactly what code units will be found in it, so one
can understand what the regex technology will do. One can even play games
with encoding="iso-8859-1", where no decoding errors are possible,
so as to guarantee no decode errors when scanning binary data that really
shouldn't be thought of as text.
The part of the data that a test pattern of an assert actually looks at
needs to be scannable by schema expectation, or scannable in actual data
contents.
If you write a test pattern, expecting to match things in what will turn
out to be the first and second elements of a complex type, but the encoding
is different for the second element, then your test pattern isn't going
to work (you may get false matches, false non-matches, or may get decode
errors) because the assumption that the data is scannable for that first
and second element, is violated.
Summary: can we just say
scan - verb
scannable - a schema component is scannable if ... current definition.
Data is scannable with respect to a specific character set encoding if,
when it is interpreted as a sequence of code units, they always decode
without error.
There are two ways to make "any data" scannable for certain.
One - set encodingErrorPolicy to 'replace'. Now all data will decode without
error because you get substitution characters insteead.
Two - change encoding to iso-8859-1 or other supported encoding where every
byte is a legal code unit.
If you are in neither situation One or Two above, then you have to say
the data is scannable, but with respect to a specific encoding. So saying
the data is scannable as ASCII means that it will not contain any bytes
that interpreted as ASCII code units will be illegal. For this example,
that means all the bytes have values from 0 to 127 (high bit not set).
If you know that will be true of your data, then even if it is binary data
you can say it is scannable as ASCII.
The representation of Packed Decimal numbers are not scannable as ascii,
for example, because any high-nibble digit 8 or greater will cause the
byte containing it to have a value 128 or higher, for which there is no
corresponding ascii code unit.
Packed Decimal numbers are also not scannable as UTF-8, because many packed
decimal bytes will not be legal utf-8 code unit values, nor will they be
in the ordered arrangements UTF-8 requires.
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
--
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU