Mike, some further responses in-line.
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From:
Mike Beckerle <mbeckerle.dfdl@gmail.com>
To:
Steve Hanson/UK/IBM@IBMGB,
Cc:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date:
11/07/2014 18:24
Subject:
Re: [DFDL-WG]
Fw: Action 233 (deferred) - "byte order not sufficient..." -
draft document on experience with binary format MIL-STD-2045
Thanks for this additional input.
Some further thoughts from IBM on your recommendations, after more internal
discussion here.
- Preferable to have dfdl:bitOrder as
a separate property rather to handle it via new dfdl:byteOrder enums. Although
new properties pose validation issues for existing schemas, this should
not compromise the language design. DFDL can choose what bitOrder/byteOrder
combinations are supported.
-
- OK with with new dfdl:byteOrder enum
for littleEndianAtomic16Bit
though can we improve the name?
I am absolutely
open to suggestions on the name. I adapted this name from the wikipedia
article terminology.
SMH: I would just drop the atomic so littleEndian16Bit
-
- dfdl:encoding has an architected system
for extra encodings so US-ASCII-7-Bit-Packed should be x-US-ASCII-7-Bit-Packed,
and the spec updated to remove specific mention of US-ASCII-7-Bit-Packed.
Thoughts:
if there is no support for this 7-bit packed ascii flavor, then there is
no point in having dfdl:bitOrder support. The two go together.
SMH: bitOrder has nothing to do with encoding. I could create a format
with no strings in it and my integers etc could have LSBF bitOrder. So
while in MIL-STD-2045 they might always appear together, that is
not generally true.
So in the section on optional DFDL features would we say
this is the optional feature:
dfdl:bitOrder="leastSignificantBitFirst" and dfdl:encoding="x-dfdl-us-ascii-7-bit-packed"
Or is there no mention of the encoding?
SMH: They are separate things so there should
be no mention of the encoding.
I raise this because the two really go together. There
is no point in having one without the other, and there needs to be an agreed-upon
standard meaning for x-dfdl-us-ascii-7-bit-packed encoding. So this
x-dfdl-us-ascii-7-bit-packed is a DFDL standard, not an implementation-defined
standard.
SMH: I agree that there needs to be a standard
definition for x-dfdl-us-ascii-7-bit-packed. Its definition is certainly
not implementation-defined, though whether it is supported is. The question
is whether it is defined as part of DFDL 1.0 spec, or whether it is defined
externally. Given that we devolve encoding definitions externally to IANA
and CCSID, it would be more consistent to point at an external definition.
We discussed proposed new dfdl:lengthKind
'fixedLengthOrTerminated'. A new enum implies that it can be used
in any scenario, so the following need to be specified.
- dfdl:terminator must be set and can
not be empty string or contain ES on its own
- If xs:string or xs:hexBinary, can maxLength
facet be used instead of dfdl:length? (Suggest no - this is variable length
data so min/maxLength are for validation only).
- Can dfdl:length be an expression? (Suggest
no unless specific use case identified)
My use case
needs only constants as the maximum, hence enum name contains "fixed"
prefix, not "explicit".
- Any special rules for emptyValueDelimiterPolicy
and nilValueDelimiterPolicy ?
Since a terminator
must be set, then these cannot be "none" or "initiator".
SMH: Doesn't follow. Today, if I specify a terminator,
it must be present, modulo EVDP/NVDP. So why is the same not true for the
new enum? If we add a new enum, it has to work in a way that is consistent
with other lengthKinds and not just for MIL-STD-2045 use cases.
- Use on complex element. Presumably dfdl:length
is first used to extract a 'box' but within that box does parser immediately
scan for the dfdl:terminator or does it descend into the complex type and
parse the children, expecting to either consume all the box or to find
the terminator at the end? (Suggest the latter).
I
have no use case that requires this for complex types at all.
Perhaps we can dodge this by having it be simpleFixedLengthOrTerminated,
and restricting it to simple types only. ?
SMH: Perhaps, but that makes this lengthKind
enum different from all the others, and that doesn't seem right.
- Use on complex element. Last child can
not be dfdl:lengthKind 'endOfParent'.
- Scanning rules: Use of this new dfdl:lengthKind
switches off any in-scope stack of terminating markup in force at that
point. Put another way, when we are scanning for the dfdl:terminator, we
are not looking for any markup from an outer scope.
So there's plenty to think about
with this new dfdl:lengthKind. A good rule for deciding whether a new dfdl:length
or dfdl:occursCountKind should be added is whether it bends some other
part of the spec out of shape. The new dfdl:lengthKind looks ok so far.
However we *think* we have come up with an alternative model which is simpler
than you one you state in the document. Example for field 'varstr' with
max length 100:
<xs:sequence dfdl:terminator="{if (fn:str-len(varstr) eq 100) then
'%ES;' else '%DEL'}" ...>
<xs:element name="varstr" type="xs:string"
dfdl:lengthKind="pattern" dfdl:pattern="([^\x7F].\x7F)|(.{100})"
... />
</xs:sequence>
Can't put dfdl:terminator with a self-referencing expression on the element.
Might need fn:exists in the dfdl:terminator expression to handle optionality.
Does that work?
I don't think this will work as %ES isn't allowed in terminators.
There is a proposal to allow it, but only when length kind is such that
one is not scanning for delimiters (same restriction as for WSP*). Let's
assume that we allow %ES for now.
SMH: This has been incorporated as an update
to erratum 2.148 and is the latest spec draft.
One beauty of your idea here is that unparsing will "just
work", so that's nice.
But I think your pattern has a bug: I think it should be dfdl:pattern="[^\x7F]{0,99}(?=\x7F)|
.{100}"
This will not capture more than 99 characters prior to
the DEL, and will not include the DEL as part of the string in the case
where a DEL is found (uses lookahead in regex). Hence, the DEL will be
available to be picked off as the terminator. Without this you end up with
the DEL in the payload.
With that I think your approach would work. So thanks for that idea.
SMH: Yes my pattern was wrong, thanks for correcting.
Perhaps there is an even simpler way to model this, which will work today
puts the conditional logic as a choice.
<choice>
<!-- length kind
pattern is needed to bound length to max of 99 -->
<element name="raw1"
type="xs:string"
dfdl:lengthKind='pattern'
dfdl:lengthPattern="[^\x7F]{0,99}"
dfdl:terminator="%DEL;"/>
<element name="raw2"
type="xs:string"
dfdl:lengthKind="explicit"
dfdl:length="100"/>
</choice>
<element name='value' type='xs:string'
dfdl:inputValueCalc='{ if (fn:exists( ../raw1
) then ../raw1 else ../raw2 }'/>
We still have to play the hidden group game though to
hide raw1 and raw2.
I have to think hard about how to handle a choice like this on unparsing
though. I'm uncertain about how a dfdl:outputValueCalc on raw1 would conditionally
fail, so that raw2 would be the selected output representation. We can't
use an assertion as those aren't evaluated for unparsing.
SMH: There is no way to make a choice branch
fail when unparsing. (The only 'backtracking' when unparsing a choice is
when the infoset contains no branch at all then the spec states that each
branch is examined in turn until one is found that successfully applies
defaults. But that's not really backtracking, as you can statically deduce
the branch from the schema alone, so the 'default' branch to use can be
computed up front).
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
----- Forwarded by Steve Hanson/UK/IBM on 11/07/2014 13:09 -----
From: Steve
Hanson/UK/IBM
To: Mike
Beckerle <mbeckerle.dfdl@gmail.com>,
Cc: "dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date: 08/07/2014
13:31
Subject: Re:
[DFDL-WG] Action 233 (deferred) - "byte order not sufficient..."
- draft document on experience with binary format MIL-STD-2045
Mike
Please find attached IBM's initial comments to your experience document,
as Word comments. We only got as far as the 3 x required extensions,
not looked at the optional usability stuff in detail yet.
We think we have our collective heads around the least significant bit
ordering concept, but we think the explanation could be clearer and show
the bits on-the-wire. Some debate as to whether this could be considered
some variation of byteOrder but you've obviously thought this through and
concluded a separate property is best. Also should bit order apply to text
reps, given that byteOrder is binary rep only and any byte ordering variations
in encodings are handled as separate encodings (eg, UTF-16LE and UTF-16BE).
Regarding the US-ASCII-7-Bit-Packed encoding enum, this was added via erratum
previously using the idea of DFDL-specific named encoding. But we are thinking
that this could have been handled as an x- encoding, rather than specifically
adding it to the spec. And thinking further on that same thread,
should byteOrder be made to work like encoding and allow x- enums, then
the new byteOrder would become a x- enum. The Wikipedia article you
cite on Endianness mentions other byte orders (eg, Middle-Endian, PDP-Endian).
Regards
Steve Hanson
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
IBM SWG, Hursley, UK
smh@uk.ibm.com
tel:+44-1962-815848
From: Mike
Beckerle <mbeckerle.dfdl@gmail.com>
To: "dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>,
Date: 24/06/2014
20:27
Subject: [DFDL-WG]
Action 233 (deferred) - "byte order not sufficient..." - draft
document on experience with binary format MIL-STD-2045
Sent by: dfdl-wg-bounces@ogf.org
I have created an experience document about the "bit order" issue,
which was a deferred action 233, and the subject of a public comment.
The document is here: http://redmine.ogf.org/dmsf_files/13268.
The public comment item is http://redmine.ogf.org/boards/15/topics/43.
It recommends a new dfdl:bitOrder property, and a new dfdl:byteOrder enum
value, without which it is impossible to model these data formats. It also
recommends several other improvements to DFDL to facilitate handling
these data formats.
The formats in question are a variety of MIL-STD formats which are all
densely packed binary data. These formats are in broad use. MIL-STD-2045
is one part of this family and this particular format specification is
generally available without any restrictions from a US DoD web site (http://assistdocs.com)
so I made this specific format the subject of the document as it illustrates
all the problematic issues.
We have implemented the dfdl:bitOrder property in Daffodil, and it works
with some useful tests now passing.
We have also enhanced our TDML implementation to enable creation of tests
for this feature (and in the process actually found two bugs in the MIL-STD-2045
spec!).
Both the property and this TDML enhancement are described in the document.
The sponsors of the Daffodil project are extremely keen to get this needed
binary support into the DFDL v1.0 standard so as to have multiple DFDL
implementations support it.
...mikeb
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | www.tresys.com
Please note: Contributions to the DFDL Workgroup's email discussions are
subject to the OGF
Intellectual Property Policy
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU