Re: [DFDL-WG] Minutes from 2007-08-08 Call
I forgot to clarify Simon's question on sp165.
This was the 'finalTerminatorCanBeMissing" property.
We considered the comment that this might be unnecessary.
Use case: file of text format. Each "record" in the file is terminated by
a CRLF so sez the user. At the top level this file contains an array of
these records.
The file might or might not have a CRLF at the end of the file because
human beings might have edited the file with a text editor, and either
inserted or neglected to insert this final CRLF.
We want the file format to be legal with or without the final CRLF;
however, all prior CRLFs in the file must be present.
So how to express this:
1) CRLF is a terminator of the record
2) CRLF is an occursSeparator of the enclosing array, records have no
terminator. We enclose the array in a sequence group where the array is
followed by a hidden "optional" (minOccurs=0 max=1) element of
fixed="CRLF" string value.
Choice (1) requires that we have finalTerminatorCanBeMissing
Choice (2) is just modeling the behavior that is required directly via
hidden elements. This is tantamount to saying that this keyword is not
worth having because there is a way to model it already. This is true of
many keywords. If we deem this one too obscure, then we need to revisit
many others. (Leading/Trailing Skip Bytes is a good example. Trivially
represented by a hidden element). What are our criteria for inclusion? Up
until now our criteria have been to include things that existing systems
already have found a need for. However, existing systems don't have hidden
field capability.
Note that this same missing final terminator issue can come up not only
with End-of-data, but with any bounded size structure.
E.g., suppose we say that an array has occursUnits="bytes" and
occursPath="874". Then it is 874 bytes long. The array elements can be
terminated by a particular data. E.g., semicolon. For the same reasons as
the CRLF example above, we want to be able to tolerate a missing final
semicolon before the end of the 874 bytes. In effect the
byte-length-limit creates an implicit "end-of-data" for a sub-stream
consisting of just those bytes.
Conclusion: finalTerminatorCanBeMissing seems to be useful enough and
comes up often enough that I think the keyword is worthwhile.
Implication: we should create a list of keywords or enumerated values for
properties that we think are in the grey area where perhaps we want to
drop them. Here's some candidates: byteOrderMarkPolicy,
leading/trailingSkipBytes. Both these can be modeled readily as hidden
elements. There are probably others.
Mike Beckerle
STSM, Architect, Scalable Computing
IBM Software Group
Information Platform and Solutions
Westborough, MA 01581
direct: voice and FAX 508-599-7148
assistant: Pam Riordan
priordan@us.ibm.com
508-599-7046
Mike Beckerle/Worcester/IBM
08/14/2007 08:40 AM
To
"Simon Parker"
I've spent today catching up with the recent DFDL spec discussions around
Simon's comments to v0.19. Some comments of my own on the content of these
and previous call minutes.
- General principle: The eventual consumers of DFDL will be users the
majority of whom will not be data modelling experts, that's certainly the
experience at IBM. Most see data modelling as a black art and find it
difficult. I think that an over-reliance on hidden elements is not going
to go down well. I would err on the side of caution here, and only if we
are convinced a property will be very rarely used should we remove it and
replace by a hidden element.
- Leading/Trailing Skip Bytes is a property intended to handle the byte
skipping added by compilers, over and above simple byte alignment rules.
The formulae for setting the values is beyond the ken of users to set
manually, it would invariably be done using an automated COBOL -> DFDL
translator, etc. I would not be too troubled if that went 'hidden'.
'finalTerminatorCanBeMissing' property. The rules for interpreting what
trailing markup actually means are complex and properties like this will
almost certainly be needed. (Aside: For Mike's second example, though,
where data of max length n is terminated by markup only if actual length <
n, wouldn't that be better expressed using a regular expression?
finalTerminatorCanBeMissing is too general, and could lead the parser to
validly parse data where the terminator was accidentally omitted).
- Infix/prefix/postfix separators. I believe this should be retained. It's
in IBM WTX (Mercator) and I frequently have to apologise for the absence
of postfix in IBM MRM. When a user sees (eg) x,y,z it's easier for him to
comprehend that the comma after z is a postfix separator rather than the
terminator of the parent group.
- Simon had a comment on the removal of 'applies' which I haven't seen
discussed ("I find this cumbersome. I suggest this alternative: drop
?applies? and ?dfdl:format?, insist on ?dfdl:sequence? and friends
instead, and add local variants like ?dfdl:sequenceLocal?. For attribute
shorthand, add boolean attributes with the same name: sequenceLocal=?true?
(optional, default false)."). I don't follow, the use of 'applies' is
orthogonal to whether you use dfdl:format or one of the specific elements
such as dfdl:sequence.
Regards, Steve
Steve Hanson
WebSphere Message Brokers
Hursley, UK
Internet: smh@uk.ibm.com
Phone (+44)/(0) 1962-815848
Mike Beckerle
Responses embedded below
Simon
________________________________
From: Steve Hanson [mailto:smh@uk.ibm.com]
Sent: 15 August 2007 12:23
To: Mike Beckerle
Cc: dfdl-wg@ogf.org; dfdl-wg-bounces@ogf.org; Simon Parker
Subject: [DFDL-WG] Minutes from 2007-08-08 Call - comments from
Steve
I've spent today catching up with the recent DFDL spec
discussions around Simon's comments to v0.19. Some comments of my own on
the content of these and previous call minutes.
- General principle: The eventual consumers of DFDL will be
users the majority of whom will not be data modelling experts, that's
certainly the experience at IBM. Most see data modelling as a black art
and find it difficult. I think that an over-reliance on hidden elements
is not going to go down well. I would err on the side of caution here,
and only if we are convinced a property will be very rarely used should
we remove it and replace by a hidden element.
[Simon] Accepted, providing we can specify everything. Ideally
we'll publish a rigorous, orthogonal language and a convenient,
intuitive library with controlled redundancy.
- Leading/Trailing Skip Bytes is a property intended to handle
the byte skipping added by compilers, over and above simple byte
alignment rules. The formulae for setting the values is beyond the ken
of users to set manually, it would invariably be done using an automated
COBOL -> DFDL translator, etc. I would not be too troubled if that went
'hidden'.
'finalTerminatorCanBeMissing' property. The rules for
interpreting what trailing markup actually means are complex and
properties like this will almost certainly be needed. (Aside: For Mike's
second example, though, where data of max length n is terminated by
markup only if actual length < n, wouldn't that be better expressed
using a regular expression? finalTerminatorCanBeMissing is too general,
and could lead the parser to validly parse data where the terminator was
accidentally omitted).
- Infix/prefix/postfix separators. I believe this should be
retained. It's in IBM WTX (Mercator) and I frequently have to apologise
for the absence of postfix in IBM MRM. When a user sees (eg) x,y,z it's
easier for him to comprehend that the comma after z is a postfix
separator rather than the terminator of the parent group.
- Simon had a comment on the removal of 'applies' which I
haven't seen discussed ("I find this cumbersome. I suggest this
alternative: drop 'applies' and 'dfdl:format', insist on 'dfdl:sequence'
and friends instead, and add local variants like 'dfdl:sequenceLocal'.
For attribute shorthand, add boolean attributes with the same name:
sequenceLocal="true" (optional, default false)."). I don't follow, the
use of 'applies' is orthogonal to whether you use dfdl:format or one of
the specific elements such as dfdl:sequence.
[Simon] You're right, the ideas should be discussed separately.
My hasty comment throws it all in together.
1 Replace this:
Thanks for the explanation, Mike.
This helpful principle is expressed on page 105:
"a parser for any construct (simple or complex) consumes its own
delimiters and only its own delimiters"
Separators belong to the file, terminators belong to the record.
The lenient record-per-line text file can be viewed in several ways,
such as:
file with prefix separators and optional terminator = [record,
{newline, record}], [newline];
file with suffix separators and optional terminator = [{record,
newline}, record], [newline];
file with terminated records last optional = {terminated record},
[record];
terminated record = record, newline;
record = characters except newline;
newline = CR, LF;
Is there anything to choose between these interpretations? Perhaps it's
not our business to worry about it anyway.
I did indeed mean that the property is redundant, and was advocating the
smallest possible language. I still favour a small language, but I now
accept that it needs to be supported by a rich library of convenient
secondary properties in which controlled redundancy is acceptable. We
don't need to talk of eliminating such constructs providing we can find
a good way to express this language/library division.
Simon
________________________________
From: Mike Beckerle [mailto:beckerle@us.ibm.com]
Sent: 14 August 2007 14:24
To: dfdl-wg@ogf.org; Simon Parker
Subject: Re: [DFDL-WG] Minutes from 2007-08-08 Call
I forgot to clarify Simon's question on sp165.
This was the 'finalTerminatorCanBeMissing" property.
We considered the comment that this might be unnecessary.
Use case: file of text format. Each "record" in the file is
terminated by a CRLF so sez the user. At the top level this file
contains an array of these records.
The file might or might not have a CRLF at the end of the file
because human beings might have edited the file with a text editor, and
either inserted or neglected to insert this final CRLF.
We want the file format to be legal with or without the final
CRLF; however, all prior CRLFs in the file must be present.
So how to express this:
1) CRLF is a terminator of the record
2) CRLF is an occursSeparator of the enclosing array, records
have no terminator. We enclose the array in a sequence group where the
array is followed by a hidden "optional" (minOccurs=0 max=1) element of
fixed="CRLF" string value.
Choice (1) requires that we have finalTerminatorCanBeMissing
Choice (2) is just modeling the behavior that is required
directly via hidden elements. This is tantamount to saying that this
keyword is not worth having because there is a way to model it already.
This is true of many keywords. If we deem this one too obscure, then we
need to revisit many others. (Leading/Trailing Skip Bytes is a good
example. Trivially represented by a hidden element). What are our
criteria for inclusion? Up until now our criteria have been to include
things that existing systems already have found a need for. However,
existing systems don't have hidden field capability.
Note that this same missing final terminator issue can come up
not only with End-of-data, but with any bounded size structure.
E.g., suppose we say that an array has occursUnits="bytes" and
occursPath="874". Then it is 874 bytes long. The array elements can be
terminated by a particular data. E.g., semicolon. For the same reasons
as the CRLF example above, we want to be able to tolerate a missing
final semicolon before the end of the 874 bytes. In effect the
byte-length-limit creates an implicit "end-of-data" for a sub-stream
consisting of just those bytes.
Conclusion: finalTerminatorCanBeMissing seems to be useful
enough and comes up often enough that I think the keyword is worthwhile.
Implication: we should create a list of keywords or enumerated
values for properties that we think are in the grey area where perhaps
we want to drop them. Here's some candidates: byteOrderMarkPolicy,
leading/trailingSkipBytes. Both these can be modeled readily as hidden
elements. There are probably others.
Mike Beckerle
STSM, Architect, Scalable Computing
IBM Software Group
Information Platform and Solutions
Westborough, MA 01581
direct: voice and FAX 508-599-7148
assistant: Pam Riordan
priordan@us.ibm.com
508-599-7046
Mike Beckerle/Worcester/IBM
08/14/2007 08:40 AM
To
"Simon Parker"
participants (3)
-
Mike Beckerle
-
Simon Parker
-
Steve Hanson