I've spent today catching up with the
recent DFDL spec discussions around Simon's comments to v0.19. Some comments
of my own on the content of these and previous call minutes.
- General principle: The eventual
consumers of DFDL will be users the majority of whom will not be data
modelling experts, that's certainly the experience at IBM. Most see data
modelling as a black art and find it difficult. I think that an over-reliance
on hidden elements is not going to go down well. I would err on the side of
caution here, and only if we are convinced a property will be very rarely used
should we remove it and replace by a hidden element.
[Simon] Accepted, providing we can specify everything. Ideally
we'll publish a rigorous, orthogonal language and a convenient, intuitive
library with controlled redundancy.
- Leading/Trailing Skip Bytes is a property intended to
handle the byte skipping added by compilers, over and above simple byte
alignment rules. The formulae for setting the values is beyond the ken of
users to set manually, it would invariably be done using an automated COBOL
-> DFDL translator, etc. I would not be too troubled if that went
'hidden'.
'finalTerminatorCanBeMissing' property. The rules for interpreting what
trailing markup actually means are complex and properties like this will
almost certainly be needed. (Aside: For Mike's second example, though, where
data of max length n is terminated by markup only if actual length < n,
wouldn't that be better expressed using a regular expression?
finalTerminatorCanBeMissing is too general, and could lead the parser to
validly parse data where the terminator was accidentally omitted).
- Infix/prefix/postfix separators. I
believe this should be retained. It's in IBM WTX (Mercator) and I frequently
have to apologise for the absence of postfix in IBM MRM. When a user sees (eg)
x,y,z it's easier for him to comprehend that the comma after z is a postfix
separator rather than the terminator of the parent group.
- Simon had a comment on the removal of 'applies'
which I haven't seen discussed ("I find this cumbersome. I suggest this alternative: drop
‘applies’ and ‘dfdl:format’, insist on ‘dfdl:sequence’ and friends instead,
and add local variants like ‘dfdl:sequenceLocal’. For attribute shorthand, add
boolean attributes with the same name: sequenceLocal=”true” (optional, default
false)."). I don't follow, the use of 'applies' is orthogonal to whether you
use dfdl:format or one of the specific elements such as dfdl:sequence.
[Simon] You're
right, the ideas should be discussed separately. My hasty
comment throws it all in together.
1 Replace this:
<dfdl:format
applies="hereOnly">
with
this:
<dfdl:formatLocal>
Why?
Because 'applies' is a metaproperty that doesn't describe the representation,
and should be prominent. Also, for brevity.
2
Replace this:
<dfdl:format>
with
one of these:
<dfdl:element>
<dfdl:sequence> <dfdl:complexType>...
Why?
For ease of validation and interpretation, to make mistakes more obvious to
human readers, and to support more rigorous specification of the relationship
between properties and xsd constructs.
Regards, Steve
Steve Hanson
WebSphere Message
Brokers
Hursley, UK
Internet: smh@uk.ibm.com
Phone (+44)/(0)
1962-815848
Mike Beckerle
<beckerle@us.ibm.com> Sent by: dfdl-wg-bounces@ogf.org
14/08/2007 14:23
|
To
| dfdl-wg@ogf.org, "Simon Parker"
<simon.parker@polarlake.com>
|
cc
|
|
Subject
| Re: [DFDL-WG] Minutes from
2007-08-08 Call |
|
I forgot to clarify Simon's question on
sp165.
This
was the 'finalTerminatorCanBeMissing" property.
We considered the comment
that this might be unnecessary.
Use case: file of text format. Each "record" in the
file is terminated by a CRLF so sez the user. At the top level this file
contains an array of these records.
The file might or might not have a CRLF at the end
of the file because human beings might have edited the file with a text
editor, and either inserted or neglected to insert this final
CRLF.
We want
the file format to be legal with or without the final CRLF; however, all prior
CRLFs in the file must be present.
So how to express this:
1) CRLF is a terminator of the
record
2) CRLF is
an occursSeparator of the enclosing array, records have no terminator. We
enclose the array in a sequence group where the array is followed by a hidden
"optional" (minOccurs=0 max=1) element of fixed="CRLF" string
value.
Choice
(1) requires that we have finalTerminatorCanBeMissing
Choice (2) is just modeling the
behavior that is required directly via hidden elements. This is tantamount to
saying that this keyword is not worth having because there is a way to model
it already. This is true of many keywords. If we deem this one too obscure,
then we need to revisit many others. (Leading/Trailing Skip Bytes is a good
example. Trivially represented by a hidden element). What are our
criteria for inclusion? Up until now our criteria have been to include things
that existing systems already have found a need for. However, existing systems
don't have hidden field capability.
Note that this same missing final terminator issue
can come up not only with End-of-data, but with any bounded size
structure.
E.g., suppose we say that an array has occursUnits="bytes" and
occursPath="874". Then it is 874 bytes long. The array elements can be
terminated by a particular data. E.g., semicolon. For the same reasons as the
CRLF example above, we want to be able to tolerate a missing final semicolon
before the end of the 874 bytes. In effect the byte-length-limit creates
an implicit "end-of-data" for a sub-stream consisting of just those bytes.
Conclusion:
finalTerminatorCanBeMissing seems to be useful enough and comes up often
enough that I think the keyword is worthwhile.
Implication: we should create a
list of keywords or enumerated values for properties that we think are
in the grey area where perhaps we want to drop them. Here's some candidates:
byteOrderMarkPolicy, leading/trailingSkipBytes. Both these can be modeled
readily as hidden elements. There are probably others.
Mike Beckerle
STSM, Architect,
Scalable Computing
IBM Software Group
Information Platform and
Solutions
Westborough, MA 01581
direct: voice and FAX
508-599-7148
assistant: Pam Riordan
priordan@us.ibm.com
508-599-7046
Mike
Beckerle/Worcester/IBM
08/14/2007 08:40 AM
|
To
| "Simon Parker"
<simon.parker@polarlake.com>
|
cc
| dfdl-wg@ogf.org
|
Subject
| Re: [DFDL-WG] Minutes from
2007-08-08 CallLink |
|
In conjunction with
the annotated document these notes are clear, except for 'sp165'. Perhaps
someone will recapitulate the discussion briefly at Wednesday's conference. I
think only three annotations remain:
sp167 Absent and missing
(expanded discussion on the wiki already)
This will be a major topic on a
call.
sp172 separatorType="infix"
I'm happy to drop this
strange stuff about separatorType=prefix or postfix and just say separator
means infix. However, I would note that at least two major integration
products (IBM WebSphere Transformation Extender - formerly Mercator, and
Microsoft Biztalk, have this concept, so we may end up putting it back in.
Presumably MS copied the earlier Mercator style, or both got it from common
requirements in some EDI standard.
sp173 defaultWhenMissing
(expanded discussion on the wiki already)
Same
topic as sp167 above. Will have a call topic to discuss.
I've added another
contribution to the wiki discussion on 'require'.
This seems to be at
resolution I think, which is that we can express this using assertions. The
general style of using DFDL to describe what fixed-data syntactic constructs
look like is a good one.
However, I've amended the Wiki thread on this with a
further issue for group consideration. See bottom of page:
https://forge.gridforum.org/sf/wiki/do/viewPage/projects.dfdl-wg/wiki/Require?_message=1187096164776
The 'length
and occurs' proposal is an improvement, though I still have reservations to
discuss; likewise the 'opaque data' proposal.
For a call, this week or soon. I
will send out an agenda.
Mike Beckerle
STSM, Architect, Scalable Computing
IBM
Software Group
Information Platform and Solutions
Westborough, MA
01581
direct: voice and FAX 508-599-7148
assistant: Pam Riordan
priordan@us.ibm.com
508-599-7046
"Simon Parker"
<simon.parker@polarlake.com> Sent by:
dfdl-wg-bounces@ogf.org
08/13/2007 10:56 AM
|
To
| <dfdl-wg@ogf.org>
|
cc
|
|
Subject
| Re: [DFDL-WG] Minutes from
2007-08-08 Call |
|
In
conjunction with the annotated document these notes are clear, except for
'sp165'. Perhaps someone will recapitulate the discussion briefly at
Wednesday's conference. I think only three annotations remain:
sp167
Absent and missing (expanded discussion on the wiki already)
sp172
separatorType="infix"
sp173 defaultWhenMissing (expanded discussion on the
wiki already)
I've added another contribution to the wiki discussion on
'require'.
The 'length and occurs' proposal is an improvement, though I still
have reservations to discuss; likewise the 'opaque data' proposal.
Regards,
Simon
From: dfdl-wg-bounces@ogf.org
[mailto:dfdl-wg-bounces@ogf.org] On Behalf Of Mike
Beckerle
Sent: 08 August 2007 18:00
To:
dfdl-wg@ogf.org
Subject: [DFDL-WG] Minutes from 2007-08-08
Call
MikeB,
Geoff Judd, Alan Powell attended.
Continued through SP's comments.
sp37 - got it.
sp45 - agree. This whole
part to be rewritten.
sp115 - ok. strict and "lax" as enums. No built-in default - we
never use defaults in the processor itself. Only in the predefined
formats.
sp118
- ok
sp123 -
Proposal to simplify length, lengthKind, lengthUnits, and also occursKind,
occursPath, occursPathUnits needed. (along the lines of byteCount, itemCount,
length='delimited' enum, etc.)
sp154 - Need specific proposal to eliminate
hexBinary and use what for opaque (consider also string with encoding='bytes'.
) Or introduce a dfdl:byteString type or dfdl:opaque type. (derived type
- just a standard name).
sp158 - see sp123
sp165 - needed to have composition property for
enclosing groups and or end-of-data. Regexp doesn't fix this.
Mike Beckerle
STSM,
Architect, Scalable Computing
IBM Software Group
Information Platform
and Solutions
Westborough, MA 01581
direct: voice and FAX
508-599-7148
assistant: Pam Riordan
priordan@us.ibm.com
508-599-7046
--
dfdl-wg mailing
list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
--
dfdl-wg mailing
list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM
United Kingdom Limited - Registered in England and Wales with number 741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU