Thanks for the explanation, Mike.
This helpful principle is expressed on page
105:
"a parser for any construct (simple or complex) consumes its own
delimiters and only its own delimiters"
Separators belong to the file, terminators belong to the
record.
The lenient record-per-line text file can be viewed in
several ways, such as:
file with prefix separators and optional terminator =
[record, {newline, record}], [newline];
file with suffix
separators and optional terminator = [{record, newline}, record],
[newline];
file with terminated records last optional =
{terminated record}, [record];
terminated record = record,
newline;
record = characters except
newline;
newline = CR, LF;
Is
there anything to choose between these interpretations? Perhaps it's not our
business to worry about it anyway.
I did
indeed mean that the property is redundant, and was advocating the smallest
possible language. I still favour a small language, but I now accept that it
needs to be supported by a rich library of convenient secondary properties
in which controlled redundancy is acceptable. We don't need to talk of
eliminating such constructs providing we can find a good way to express this
language/library division.
Simon
I forgot to clarify Simon's
question on sp165.
This was the
'finalTerminatorCanBeMissing" property.
We considered the comment that this might be unnecessary.
Use case: file of text format.
Each "record" in the file is terminated by a CRLF so sez the user. At the top
level this file contains an array of these records.
The file might or might not have a CRLF at the end of
the file because human beings might have edited the file with a text editor,
and either inserted or neglected to insert this final CRLF.
We want the file format to be legal with
or without the final CRLF; however, all prior CRLFs in the file must be
present.
So how to express
this:
1) CRLF is a terminator of the
record
2) CRLF is an occursSeparator
of the enclosing array, records have no terminator. We enclose the array in a
sequence group where the array is followed by a hidden "optional" (minOccurs=0
max=1) element of fixed="CRLF" string value.
Choice (1) requires that we have
finalTerminatorCanBeMissing
Choice
(2) is just modeling the behavior that is required directly via hidden
elements. This is tantamount to saying that this keyword is not worth having
because there is a way to model it already. This is true of many keywords. If
we deem this one too obscure, then we need to revisit many others.
(Leading/Trailing Skip Bytes is a good example. Trivially represented by a
hidden element). What are our criteria for inclusion? Up until now our
criteria have been to include things that existing systems already have found
a need for. However, existing systems don't have hidden field
capability.
Note that this same
missing final terminator issue can come up not only with End-of-data, but with
any bounded size structure.
E.g.,
suppose we say that an array has occursUnits="bytes" and occursPath="874".
Then it is 874 bytes long. The array elements can be terminated by a
particular data. E.g., semicolon. For the same reasons as the CRLF example
above, we want to be able to tolerate a missing final semicolon before the end
of the 874 bytes. In effect the byte-length-limit creates an implicit
"end-of-data" for a sub-stream consisting of just those bytes.
Conclusion:
finalTerminatorCanBeMissing seems to be useful enough and comes up often
enough that I think the keyword is worthwhile.
Implication: we should create a list of keywords or
enumerated values for properties that we think are in the grey area
where perhaps we want to drop them. Here's some candidates:
byteOrderMarkPolicy, leading/trailingSkipBytes. Both these can be modeled
readily as hidden elements. There are probably others.
Mike Beckerle
STSM, Architect, Scalable
Computing
IBM Software Group
Information Platform and
Solutions
Westborough, MA 01581
direct: voice and FAX
508-599-7148
assistant: Pam Riordan
priordan@us.ibm.com
508-599-7046
Mike
Beckerle/Worcester/IBM
08/14/2007 08:40 AM
|
To
| "Simon Parker"
<simon.parker@polarlake.com>
|
cc
| dfdl-wg@ogf.org
|
Subject
| Re: [DFDL-WG] Minutes from
2007-08-08 CallLink |
|
In conjunction with the annotated document these notes are
clear, except for 'sp165'. Perhaps someone will recapitulate the discussion
briefly at Wednesday's conference. I think only three annotations
remain:
sp167
Absent and missing (expanded discussion on the wiki already)
This will be a major topic on a
call.
sp172
separatorType="infix"
I'm
happy to drop this strange stuff about separatorType=prefix or postfix and
just say separator means infix. However, I would note that at least two major
integration products (IBM WebSphere Transformation Extender - formerly
Mercator, and Microsoft Biztalk, have this concept, so we may end up putting
it back in. Presumably MS copied the earlier Mercator style, or both got it
from common requirements in some EDI standard.
sp173 defaultWhenMissing (expanded discussion
on the wiki already)
Same topic as sp167 above.
Will have a call topic to discuss.
I've added another contribution to the
wiki discussion on 'require'.
This seems to be at resolution I think, which is that we can express
this using assertions. The general style of using DFDL to describe what
fixed-data syntactic constructs look like is a good one.
However, I've amended the Wiki thread on this
with a further issue for group consideration. See bottom of page:
https://forge.gridforum.org/sf/wiki/do/viewPage/projects.dfdl-wg/wiki/Require?_message=1187096164776
The
'length and occurs' proposal is an improvement, though I still have
reservations to discuss; likewise the 'opaque data' proposal.
For a call, this week or soon. I will
send out an agenda.
Mike
Beckerle
STSM, Architect, Scalable Computing
IBM Software
Group
Information Platform and Solutions
Westborough, MA
01581
direct: voice and FAX 508-599-7148
assistant: Pam Riordan
priordan@us.ibm.com
508-599-7046
"Simon Parker"
<simon.parker@polarlake.com> Sent by: dfdl-wg-bounces@ogf.org
08/13/2007 10:56 AM
|
To
| <dfdl-wg@ogf.org>
|
cc
|
|
Subject
| Re: [DFDL-WG] Minutes from
2007-08-08 Call |
|
In conjunction
with the annotated document these notes are clear, except for 'sp165'. Perhaps
someone will recapitulate the discussion briefly at Wednesday's conference. I
think only three annotations remain:
sp167 Absent and missing (expanded discussion
on the wiki already)
sp172 separatorType="infix"
sp173 defaultWhenMissing (expanded discussion on the wiki
already)
I've added another contribution to the wiki discussion on
'require'.
The 'length and occurs' proposal is an improvement, though I
still have reservations to discuss; likewise the 'opaque data'
proposal.
Regards,
Simon
From: dfdl-wg-bounces@ogf.org
[mailto:dfdl-wg-bounces@ogf.org] On Behalf Of Mike
Beckerle
Sent: 08 August 2007 18:00
To:
dfdl-wg@ogf.org
Subject: [DFDL-WG] Minutes from 2007-08-08
Call
MikeB,
Geoff Judd, Alan Powell attended.
Continued through SP's comments.
sp37 - got it.
sp45 - agree. This whole
part to be rewritten.
sp115 - ok. strict and "lax" as enums. No built-in default - we
never use defaults in the processor itself. Only in the predefined
formats.
sp118
- ok
sp123 -
Proposal to simplify length, lengthKind, lengthUnits, and also occursKind,
occursPath, occursPathUnits needed. (along the lines of byteCount, itemCount,
length='delimited' enum, etc.)
sp154 - Need specific proposal to eliminate
hexBinary and use what for opaque (consider also string with encoding='bytes'.
) Or introduce a dfdl:byteString type or dfdl:opaque type. (derived type
- just a standard name).
sp158 - see sp123
sp165 - needed to have composition
property for enclosing groups and or end-of-data. Regexp doesn't fix this.
Mike
Beckerle
STSM, Architect, Scalable Computing
IBM Software
Group
Information Platform and Solutions
Westborough, MA
01581
direct: voice and FAX 508-599-7148
assistant: Pam Riordan
priordan@us.ibm.com
508-599-7046
--
dfdl-wg mailing
list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg