I prefer use of xs:anyUri, it gives a clear
indication that this a reference to the data and not the data itself.
I prefer dfdl:objectKind - the object
is not necessarily large, the author might want a reference for other reasons.
Regards
Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM
DFDL
Co-Chair, OGF
DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday
From:
"Lawrence, Stephen"
<slawrence@tresys.com>
To:
"mbeckerle.dfdl@gmail.com"
<mbeckerle.dfdl@gmail.com>
Cc:
"dfdl-wg@ogf.org"
<dfdl-wg@ogf.org>
Date:
12/08/2019 12:16
Subject:
Re: [DFDL-WG]
BLOB - binary large object proposal - updated
Sent by:
"dfdl-wg"
<dfdl-wg-bounces@ogf.org>
dfdl:largeObjectKind definitely keeps things simple,
but does lose
flexibility (e.g. maxLength). But that may not be needed. I'm in favor
of this.
However, one drawback of using xs:string instead of xs:anyURI relates to
our TDML test rig. If we make the type of blobs/clobs an xs:string and
have it be an opaque identifier then it makes it difficult for our TDML
runner to know how to compare actual vs expected blobs. For example:
<data xsi:type="xs:string">some unique identifier</data>
In this case, the TDML test rig must be schema aware to know that data
is not a string and is actually an opaque identifier. And it must know
how to use that unique identifier to lookup the bytes to do
expected/actual comparisons.
By making the type an xs:anyURI and requiring that the identifier is a
URI, a TDML runner does not need any knowledge of the schema. Since the
xsi:type is an anyURI, it can infer that this must be a blob/clob, and
then it can open the URI to determine the bytes and easly compare
expected vs actual blobs.
And this applies to anyone accessing the infoset as well--not just our
TDML runner. Using a type of xs:anyURI provides a hint to infoset users
that an element shouldn't be treated like a string, but as a blob handle.
- Steve
On 8/9/19 10:25 AM, Mike Beckerle wrote:
>
> My suggestions based on this thread are:
>
> I think the dfdlx:blob type is problematic, and we should avoid it
in favor of a
> xs:string with a dfdlx:largeObjectKind property.
>
> I think this should not be a "Type" as in string or hexBinary,
because hexBinary
> is such a misleading term, suggesting textualization, etc. There is
nothing
> "hex" about a BLOB, ever.
>
> I think dfdlx:largeObjectKind="bytes/chars/none" with none
the "default" for
> now, and "chars" as a future capability for character large
objects if they
> prove important.
> I could be convinced other enums are better than bytes or chars for
this. Eg.,
> BLOB, CLOB might be better. Or perhaps this is
> dfdl:largeObjectRep="binary/text/none" analogous to the
dfdl:representation
> property?
>
> The use of xs:anyURI is unnecessary, and is not a type we have in
DFDL as yet.
> People should treat this string as opaque. The fact that it is potentially
a
> meaningful URI is not relevant, and can be an implementation detail.
>
> I think dfdl:largeObjectDirectory="{ $dfdlx:largeObjectDirectory
}" is a nice
> idea to save for the future. We may find that numerous other parameters
are
> required, so I'd prefer not to predefine this one in advance of clearer
> direction or whether there are others.
>
> The other thing observed on yesterday's DFDL WG call, was that this
has some
> overlap with the offset/pointer stuff. Unparsing from a blob file
is an awful
> lot like data-source indirection where the source of unparsing is
coming from a
> scattered data structure that is being gathered. There is some conceptual
> similarity anyway. Not sure how deep this goes or if it is just a
superficial
> observation. And I would not suggest waiting for that to be figured
out before
> proceeding with this experimental BLOB feature.
>
> Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
www.tresys.com
> <http://www.tresys.com
>
> Please note: Contributions to the DFDL Workgroup's email discussions
are subject
> to the OGF Intellectual Property Policy <http://www.ogf.org/About/abt_policies.php
>
>
>
>
> On Thu, Aug 8, 2019 at 8:40 AM Lawrence, Stephen <slawrence@tresys.com
> <mailto:slawrence@tresys.com>>
wrote:
>
> The intention was that this new type would be an internal
built-in type
> and so no extra properties could be placed on the new
simple type. One
> drawback that I'm realizing as I implement this feature
in Daffodil, is
> that in order to use non DFDL aware XML Validation tools
to validate the
> XML infoset, you need to provide and xs:import this
new DFDL schema that
> defines the dfdlx:blob type, which feels a little awkward
to me for
> something that's considered a built-in for DFDL processors.
>
> Maybe an alternative would be to not have a dfdlx:blob
type, allow the
> use of the xs:anyURI type for simple elements, with
the implication that
> we treat the element as if it were xs:hexBinary except
for the
> infoset/blob output. This doesn't easily support CLOB's,
but a new DFDL
> property could determine how an xs:anyURI should be
interpreted, e.g.:
>
> <xs:element name="myBlobData"
type="xs:anyURI"
> dfdl:largeObjectType="xs:hexBinary"
... />
>
> <xs:element name="myClobData"
type="xs:anyURI"
> dfdl:largeOjectType="xs:string"
... />
>
> So a type of xs:anyURI implies this is going to be some
kind of large
> object representation, and it requires the dfdl:largeOjectType
property
> that must reference a simple type that defines how the
content should be
> turned into an large object. This might also help to
support
> restrictions on the blob data, as well as implicit lengths,
e.g.:
>
> <xs:simpleType name="blob10">
> <xs:restriction base="xs:hexBinary">
> <xs:maxLength value="20"
/
> </xs:restriction>
> </xs:simpleType>
>
> <xs:element name="data" type="xs:anyURI"
dfdl:objectType="blob10"
> dfdl:lengthKind="implict" />
>
> DFDL properties could be placed on either the element
or the objectType
> simpleType, with the base type of dfdl:largeObjectType
determining which
> properties are valid/interpreted, rather than the element
type (which
> must be anyURI).
>
> But maybe this all adds unnecessary complexity?
>
>
> Regarding specifying the filename via a DFDL property
rather than API,
> we have a use cases where each parse would need to output
to a different
> directory so a property might cause problems with this.
But perhaps this
> could be handled by a variable, e.g.:
>
> <xs:element name="data" type="dfdlx:blob"
> dfdl:blobDirectory="{ $blobDir
}" ... />
>
> That said, we had additional use cases where a DFDL
blobDirectory
> property would be too restrictive. For example, maybe
the blobs should
> be put into a database, or pushed to a data store in
the cloud, stored
> in local memory, or not stored anywhere at all but with
a special URI
> with offset+length to the original data. We chose to
ignore these
> use-cases for simplicity, but these different options
would probably
> require a flexible API to support. By going with an
API to specify the
> output directory, it makes it a bit easier to support
these different
> blob outputs in the future if it was needed.
>
>
> On 8/8/19 5:09 AM, Steve Hanson wrote:
> > Mike
> >
> > Am I allowed to put DFDL properties on the
new simple type, or is the new
> type
> > considered to be a built-in type? I
think the latter is clearer and
> simpler to
> > implement. Support for 'clob' would
then just add a new simple type
> restriction
> > 'dfdlx:clob'.
> >
> > Assuming that the feature makes it into a
future DFDL 2.0, the schema
> containing
> > the 'blob' simple type would then be in the
standard DFDL namespace.
> That's the
> > first example of such a schema, as this is
the first time we are
> extending base
> > XML Schema as opposed to defining annotations.
If the new type is
> considered a
> > built-in type, then this schema should be
part of the DFDL 2.0 standard and
> > read-only.
> >
> > Any thoughts on allowing the specification
of the filename via DFDL property
> > rather than API call?
> >
> > Presumably I could create a local restriction
of 'dfdlx:blob'? One
> motivation
> > for so doing would be to validate the length
or content of my binary data.
> > There's a problem with that though - validation
works against the
> infoset, so
> > the allowable facets are those applicable
to xs:anyUri and would be
> applied to
> > the file name, not the binary data. It also
means that dfdl:lengthKind
> > 'implicit' can't be used. I don't see
a way round this.
> >
> > Regards
> >
> > Steve Hanson
> >
> > IBM Hybrid Integration, Hursley, UK
> > Architect, _IBM DFDL_
> <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
> > Co-Chair, _OGF DFDL Working Group_ <http://www.ogf.org/dfdl/
>_
> > __smh@uk.ibm.com_ <mailto:smh@uk.ibm.com
<mailto:smh@uk.ibm.com>>
> > tel:+44-1962-815848
> > mob:+44-7717-378890
> > Note: I work Tuesday to Friday
> >
> >
> >
> > From: Mike Beckerle <mbeckerle.dfdl@gmail.com
> <mailto:mbeckerle.dfdl@gmail.com>>
> > To: DFDL-WG <dfdl-wg@ogf.org <mailto:dfdl-wg@ogf.org>>
> > Date: 12/07/2019 18:14
> > Subject: [DFDL-WG] BLOB - binary large object
proposal - updated
> > Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org
<mailto:dfdl-wg-bounces@ogf.org>>
> >
> >
> --------------------------------------------------------------------------------
> >
> >
> >
> > This concept, ,which has been discussed before,
is in high demand in the
> > Daffodil user community to enable DFDL to
be used to parse image file
> formats.
> > The use case is to provide uniform image-metadata
access without getting
> bogged
> > down in the large byte-array that makes up
most of the file and would be
> very
> > large (and pointless) if rendered into XML
or JSON.
> >
> > So our proposal, (which will get turned into
an official Experimental
> feature
> > document), has been simplified and revised
and is described here:
> >
> >
> _https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_DAFFODIL_Proposal-253A-2BBinary-2BLarge-2BObjects-5F&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=JxLz3sp40T1X-UzhjSiHRPmWRqwL3GVgkgzT2hwgiGM&e=
>
> >
> >
> > Mike Beckerle | OGF DFDL Workgroup Co-Chair
| Tresys Technology |
> > _www.tresys.com_ <http://www.tresys.com
>
> > Please note: Contributions to the DFDL Workgroup's
email discussions are
> subject
> > to the _OGF Intellectual Property Policy_
> > <http://www.ogf.org/About/abt_policies.php
>
> > --
> > dfdl-wg mailing list
> > dfdl-wg@ogf.org <mailto:dfdl-wg@ogf.org>
> > https://www.ogf.org/mailman/listinfo/dfdl-wg
> >
> > Unless stated otherwise above:
> > IBM United Kingdom Limited - Registered in
England and Wales with number
> 741598.
> > Registered office: PO Box 41, North Harbour,
Portsmouth, Hampshire PO6 3AU
> >
> >
> > --
> > dfdl-wg mailing list
> > dfdl-wg@ogf.org <mailto:dfdl-wg@ogf.org>
> > https://www.ogf.org/mailman/listinfo/dfdl-wg
> >
>
> --
> dfdl-wg mailing list
> dfdl-wg@ogf.org <mailto:dfdl-wg@ogf.org>
> https://www.ogf.org/mailman/listinfo/dfdl-wg
>
--
dfdl-wg mailing list
dfdl-wg@ogf.org
https://www.ogf.org/mailman/listinfo/dfdl-wg
Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number
741598.
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6
3AU