Comments on lengths <awp>
below </awp>
Alan Powell
MP 211, IBM UK Labs, Hursley, Winchester, SO21 2JN, England
Notes Id: Alan Powell/UK/IBM email: alan_powell@uk.ibm.com
Tel: +44 (0)1962 815073
Fax: +44 (0)1962 816898
Steve Hanson/UK/IBM@IBMGB
Sent by: dfdl-wg-bounces@ogf.org
19/09/2007 14:30
|
To
| dfdl-wg@ogf.org
|
cc
|
|
Subject
| [DFDL-WG] Fw: Notes from 2007-09-12
call |
|
More on expressions, <smh>below</smh>
Regards, Steve
Steve Hanson
WebSphere Message Brokers
Hursley, UK
Internet: smh@uk.ibm.com
Phone (+44)/(0) 1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 19/09/2007 14:14 -----
Mike Beckerle <beckerle@us.ibm.com>
19/09/2007 13:43
|
To
| Steve Hanson/UK/IBM@IBMGB
|
cc
| dfdl-wg@ogf.org, dfdl-wg-bounces@ogf.org
|
Subject
| Re: [DFDL-WG] Notes from 2007-09-12
call |
|
Comments below in BLUE
Steve Hanson <smh@uk.ibm.com>
Sent by: dfdl-wg-bounces@ogf.org
09/19/2007 06:04 AM
|
To
| dfdl-wg@ogf.org
|
cc
|
|
Subject
| Re: [DFDL-WG] Notes from 2007-09-12
call |
|
Some thoughts since last week's call:
1) Expression language
We've not thought much about how expressions will work on output. It's
fine to say something like dfdl:length="..\count+1" when parsing,
but what happens on output. I think we should not try to reverse engineer
expressions, and rely on the user to set output fields correctly. So, taking
my example, on output we would assume count had been set by the user, apply
the expression to calculate the intended length of data, then apply padding
etc rules as needed. Can we generalise that philosophy across all
our uses of expressions? If we can't then perhaps that places a bound on
the actual uses of expressions that we permit.
Inverting will generally not be possible. Just make the example dfdl:length="{
../count * ../scale + 1 }" How do you split up the length into count
and scale?
<smh>Agree</smh>
In your example, I would expect the count field to have an outputValueCalc="{
../x.length() - 1 }" (I'm assuming the field with the length calculation
formula is named "x".)
<smh>Doesn't outputValueCalc mean that we are deriving the count
from the length of the data supplied for x? That forces the user
to pad x to the correct value, in order to derive count. Which is not how
we want things to work. We want count to define the length of x, so the
DFDL serialiser can pad x according to other DFDL properties.
Maybe I'm missing something about input/outputValueCalc?</smh>
<awp> There are multiple
cases to consider. There are some formats that require the length field
to be the physical length of a structure, ie including padding, code page
considerations, etc that it is impossible for the user to know. For example
IMS transaction header has LLbbHeaderData. DFDL should fill in these lengths.
I would assume that in
most cases a field with it's length in other field is variable length
but again it may need to be the physical length.
I tend to agree with Mike
that outvaluecalc can be used to set the length field but how do we get
to ignore the dfdl:length specification on the data field?
I hope this doesn't mean
that we need to distinguish between logical and physical lengths in the
expression language.
</awp>
In general, when something uses something else in it's calculation (length
or just the value - inputValueCalc), then the inverse is outputValueCalc
on the contributing parts.
2) dfdl:length for sequences
We have three cases here:
a) Empty sequence - we agreed to disallow this
b) Non-empty normal sequence - what does the length mean here?
It means the box is potentially larger than the contents. If it isn't at
least as big it's an error. If these lengths are data dependent it could
be a processing error. Otherwise a schema definition error.
Draft 025 discusses this in the part on sequences with length.
We also could disallow this case if we want for now, knowing we could add
it back if we want.
One can always convert this case into the one below by wrapping the sequence's
child elements in an array element, with array occurrences determined by
"fillAvailableSpace" policy. If we allow this at all, I think
this should be the way we explain the semantics of it. (Though with the
inserted array the paths would all change which is undesirable. - so we
would say it works like this, but without the paths being changed...)
c) Non-empty sequence used as box array - the motivating scenario
I think we should also disallow b). If we are disallowing a) on the grounds
of not using sequence with a length to model opaque data then we should
also disallow b).
3) Handling of comments in data
Mentioning this as it was discussed on the call but not minuted.
Example: A record is allowed to be followed by multiple free form text
lines where the first such line contained //ADDINFOSTART, and the last
such line contained //ADDINFOEND.
This was put forward as a use case for regular expressions but it was noted
that an explicit dfdl:commentScheme based on, or maybe an extension of,
the existing dfdl:escapeSchema property would be a more natural solution
for users
Whooops. Forgot this. We need a concrete proposal here.
Regards, Steve
Steve Hanson
WebSphere Message Brokers
Hursley, UK
Internet: smh@uk.ibm.com
Phone (+44)/(0) 1962-815848
Mike Beckerle <beckerle@us.ibm.com>
Sent by: dfdl-wg-bounces@ogf.org
12/09/2007 21:20
|
To
| dfdl-wg@ogf.org
|
cc
|
|
Subject
| [DFDL-WG] Notes from 2007-09-12 call |
|
Mike Beckerle, Alan Powell, Steve Hanson, Suman Kalia attended.
Discussed these questions from Alan about expression language.
1. Accessing hidden values - it seems inconsistent to allow
access to hidden values when xpath is used within the DFDL domain but not
when used outside.
2. Where xpath is allowed in the schema - It is currently
allowed in an arbitrary set of properties (initiator, terminator, separator,
occurseparator, null, etc ). Why not allow it everywhere?
Wr.t. (1) we decided this is correct. path expressions for dfdl properties
can see hidden elements, path expressions in other places (e.g., schematron
assertions) cannot.
Wr.t (2) we decided that expressions should be allowed in principle everywhere
for the value of any property; however, there may be exceptions for certain
properties. Particularly, it seems some enum-valued properties are unlikely
to ever want to be expressions. Example: dfdl:representation.
However, it was also pointed out that once we put selectors back into the
language you can interleave multiple formats in the same schema, and for
any enumerated property you could just have one selector-chosen format
for each possible value of the enumerated property.
The reason we don't want a blanket statement that you can have expressions
anywhere you need a property value is that there is some potential that
this makes implementations unnecessarily complex due to the excess flexibility.
Digression: (This added by MikeB - was not part of the call today.)
Consider
dfdl:byteOrder=" if (../../x = 'B') then 'bigEndian'
else if (../../x='L') then 'littleEndian' else 'I don't know' }"
DFDL implementations must be prepared to cope with recieving "I don't
know" as the proposed value for the byteOrder. This is a schema definition
error, but it is happening at run time so becomes a processing error. The
only way to rule this out is to treat enumerated property values not as
strings but as an enum type and force the expressions that compute them
to return an enum type, not a string.
This is a kind of type inference I had hoped implementations would not
need.
Selectors have the advantage of being statically verifiable. i.e., each
selected format is known to use a value of the enum that is valid or a
diagnostic could be issued by the DFDL processor. If we allow an arbitrary
expression to return the value of an enumerated property then it presumably
could also return a nonsense value:
We discussed proposals circulated by MikeB:
Here's an update to the first one. We decided sequences shouldn't be another
way to carry opaque data. Easy and conservative way to fix this is to require
the length of an empty sequence to be zero.
Second proposal to eliminate hexBinary and base64Binary was discussed lightly.
It was suggested that one could have both, and that would make it easy
to explain what the hexBinary type is, because it is a shorthand for a
string with encoding="hex", and similarly for base64Binary. We
did not resolve this issue on the call.
Finally, we discussed regular expression features for DFDL.
There does appear to be need for regexp features to support parsing data
which is delimited by changing data content. E.g. consider "12345Mike
Beckerle". and a two-element sequence. One is a number which continues
until the first non-digit character. The other is a string which begins
with a non-digit character. Regexp length appears to be a good way to handle
this kind of thing.
Alan Powell has the action item to talk with the IBM internal TX product
group. They have a speculative parser and so have fewer regular-expression
features in their language. We want to understand how they deal with the
header, body[], trailer use case. This case is where the data is lines
of text, the header is the first line, the trailer is the last line, the
body records are everything in between and there's no content that can
be used to distinguish the record types. This is handled in some format-description
systems with regexp features. In TX this is handled by speculative parsing
and we want to understand how this comes out and if it is preferable to
adding regexp features.
Mike Beckerle
STSM, Architect, Scalable Computing
IBM Software Group
Information Platform and Solutions
Westborough, MA 01581
direct: voice and FAX 508-599-7148
assistant: Pam Riordan
priordan@us.ibm.com
508-599-7046
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
[attachment "proposal-to-simplify-opaque-types-v4.doc" deleted
by Alan Powell/UK/IBM] --
dfdl-wg mailing list
dfdl-wg@ogf.org
http://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU