Re: [DFDL-WG] BLOB - binary large object proposal - updated

29 Aug 2019

      I prefer use of xs:anyUri, it gives a clear indication that this a 
reference to the data and not the data itself.

I prefer dfdl:objectKind - the object is not necessarily large, the author 
might want a reference for other reasons. 

Regards

Steve Hanson
IBM Hybrid Integration, Hursley, UK
Architect, IBM DFDL
Co-Chair, OGF DFDL Working Group
smh@uk.ibm.com
tel:+44-1962-815848
mob:+44-7717-378890
Note: I work Tuesday to Friday 

From:   "Lawrence, Stephen" <slawrence@tresys.com>
To:     "mbeckerle.dfdl@gmail.com" <mbeckerle.dfdl@gmail.com>
Cc:     "dfdl-wg@ogf.org" <dfdl-wg@ogf.org>
Date:   12/08/2019 12:16
Subject:        Re: [DFDL-WG] BLOB - binary large object proposal - 
updated
Sent by:        "dfdl-wg" <dfdl-wg-bounces@ogf.org>

dfdl:largeObjectKind definitely keeps things simple, but does lose
flexibility (e.g. maxLength). But that may not be needed. I'm in favor
of this.

However, one drawback of using xs:string instead of xs:anyURI relates to
our TDML test rig. If we make the type of blobs/clobs an xs:string and
have it be an opaque identifier then it makes it difficult for our TDML
runner to know how to compare actual vs expected blobs. For example:

  <data xsi:type="xs:string">some unique identifier</data>

In this case, the TDML test rig must be schema aware to know that data
is not a string and is actually an opaque identifier. And it must know
how to use that unique identifier to lookup the bytes to do
expected/actual comparisons.

By making the type an xs:anyURI and requiring that the identifier is a
URI, a TDML runner does not need any knowledge of the schema. Since the
xsi:type is an anyURI, it can infer that this must be a blob/clob, and
then it can open the URI to determine the bytes and easly compare
expected vs actual blobs.

And this applies to anyone accessing the infoset as well--not just our
TDML runner. Using a type of xs:anyURI provides a hint to infoset users
that an element shouldn't be treated like a string, but as a blob handle.

- Steve

On 8/9/19 10:25 AM, Mike Beckerle wrote:
...
My suggestions based on this thread are:
I think the dfdlx:blob type is problematic, and we should avoid it in
...
xs:string with a dfdlx:largeObjectKind property.
I think this should not be a "Type" as in string or hexBinary, because 
hexBinary 
is such a misleading term, suggesting textualization, etc. There is 
nothing 
"hex" about a BLOB, ever.
I think dfdlx:largeObjectKind="bytes/chars/none" with none the "default" 
for 
now, and "chars" as a future capability for character large objects if
...
prove important.
I could be convinced other enums are better than bytes or chars for
...
BLOB, CLOB might be better. Or perhaps this is 
dfdl:largeObjectRep="binary/text/none" analogous to the 
dfdl:representation 
property?
The use of xs:anyURI is unnecessary, and is not a type we have in DFDL 
as yet. 
People should treat this string as opaque. The fact that it is
...
meaningful URI is not relevant, and can be an implementation detail.
I think dfdl:largeObjectDirectory="{ $dfdlx:largeObjectDirectory }" is a 
nice 
idea to save for the future. We may find that numerous other parameters 
are 
required, so I'd prefer not to predefine this one in advance of clearer 
direction or whether there are others.
The other thing observed on yesterday's DFDL WG call, was that this has 
some 
overlap with the offset/pointer stuff. Unparsing from a blob file is an 
awful 
lot like data-source indirection where the source of unparsing is coming 
from a 
scattered data structure that is being gathered. There is some 
conceptual 
similarity anyway. Not sure how deep this goes or if it is just a 
superficial 
observation. And I would not suggest waiting for that to be figured out 
before 
proceeding with this experimental BLOB feature.
Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology | 
www.tresys.com 
<
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.tresys.com&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=yyn8_2c8iwgOiiXgq-ZPoPKMKJo7FKAgHWXNYR-PQ3w&e=
Please note: Contributions to the DFDL Workgroup's email discussions are 
subject 
to the OGF Intellectual Property Policy <
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ogf.org_About_abt-5Fpolicies.php&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=rDxL7k1L1pv9xIdXwaEB_8Pa9Twy8dwgsicarX3l6QQ&e=
On Thu, Aug 8, 2019 at 8:40 AM Lawrence, Stephen <slawrence@tresys.com 
<mailto:slawrence@tresys.com>> wrote:
The intention was that this new type would be an internal built-in 
type
    and so no extra properties could be placed on the new simple type. 
One
    drawback that I'm realizing as I implement this feature in Daffodil, 
is
    that in order to use non DFDL aware XML Validation tools to validate
favor of a 
they 
this. Eg., 
potentially a 
the
...
XML infoset, you need to provide and xs:import this new DFDL schema
that
...
defines the dfdlx:blob type, which feels a little awkward to me for
    something that's considered a built-in for DFDL processors.
Maybe an alternative would be to not have a dfdlx:blob type, allow
the
...
use of the xs:anyURI type for simple elements, with the implication
that
...
we treat the element as if it were xs:hexBinary except for the
    infoset/blob output. This doesn't easily support CLOB's, but a new
DFDL
...
property could determine how an xs:anyURI should be interpreted,
e.g.:
...
<xs:element name="myBlobData" type="xs:anyURI"
         dfdl:largeObjectType="xs:hexBinary" ... />
<xs:element name="myClobData" type="xs:anyURI"
         dfdl:largeOjectType="xs:string" ... />
So a type of xs:anyURI implies this is going to be some kind of
large
...
object representation, and it requires the dfdl:largeOjectType
property
...
that must reference a simple type that defines how the content
should be
...
turned into an large object. This might also help to support
    restrictions on the blob data, as well as implicit lengths, e.g.:
<xs:simpleType name="blob10">
         <xs:restriction base="xs:hexBinary">
           <xs:maxLength value="20" /
         </xs:restriction>
       </xs:simpleType>
<xs:element name="data" type="xs:anyURI" dfdl:objectType="blob10"
    dfdl:lengthKind="implict" />
DFDL properties could be placed on either the element or the
objectType
...
simpleType, with the base type of dfdl:largeObjectType determining
which
...
properties are valid/interpreted, rather than the element type
(which
...
must be anyURI).
But maybe this all adds unnecessary complexity?
Regarding specifying the filename via a DFDL property rather than
API,
...
we have a use cases where each parse would need to output to a
different
...
directory so a property might cause problems with this. But perhaps
this
...
could be handled by a variable, e.g.:
<xs:element name="data" type="dfdlx:blob"
         dfdl:blobDirectory="{ $blobDir }" ... />
That said, we had additional use cases where a DFDL blobDirectory
    property would be too restrictive. For example, maybe the blobs
should
...
be put into a database, or pushed to a data store in the cloud,
stored
...
in local memory, or not stored anywhere at all but with a special
URI
...
with offset+length to the original data. We chose to ignore these
    use-cases for simplicity, but these different options would probably
    require a flexible API to support. By going with an API to specify
the
...
output directory, it makes it a bit easier to support these
different
...
blob outputs in the future if it was needed.
On 8/8/19 5:09 AM, Steve Hanson wrote:
     > Mike
     >
     > Am I allowed to put DFDL properties on the new simple type, or is
the new
...
type
     > considered to be a built-in type?  I think the latter is clearer
and
...
simpler to
     > implement.  Support for 'clob' would then just add a new simple
type
...
restriction
     > 'dfdlx:clob'.
     >
     > Assuming that the feature makes it into a future DFDL 2.0, the
schema
...
containing
     > the 'blob' simple type would then be in the standard DFDL
namespace.
...
That's the
     > first example of such a schema, as this is the first time we are
    extending base
     > XML Schema as opposed to defining annotations. If the new type is
    considered a
     > built-in type, then this schema should be part of the DFDL 2.0
standard and
...
> read-only.
     >
     > Any thoughts on allowing the specification of the filename via
DFDL property
...
> rather than API call?
     >
     > Presumably I could create a local restriction of 'dfdlx:blob'?
One
...
motivation
     > for so doing would be to validate the length or content of my
binary data.
...
> There's a problem with that though - validation works against the
    infoset, so
     > the allowable facets are those applicable to xs:anyUri and would
be
...
applied to
     > the file name, not the binary data. It also means that
dfdl:lengthKind
...
> 'implicit' can't be used.  I don't see a way round this.
     >
     > Regards
     >
     > Steve Hanson
     >
     > IBM Hybrid Integration, Hursley, UK
     > Architect, _IBM DFDL_
    <http://www.ibm.com/developerworks/library/se-dfdl/index.html>
     > Co-Chair, _OGF DFDL Working Group_ <
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ogf.org_dfdl_&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=zV4mVO-_k8lmSDWJBXzaDhi1-H3ze1FRp6YLYn7FDHM&e=
...
_
     > __smh@uk.ibm.com_ <mailto:smh@uk.ibm.com <mailto:smh@uk.ibm.com>>
     > tel:+44-1962-815848
     > mob:+44-7717-378890
     > Note: I work Tuesday to Friday
     >
     >
     >
     > From: Mike Beckerle <mbeckerle.dfdl@gmail.com
    <mailto:mbeckerle.dfdl@gmail.com>>
     > To: DFDL-WG <dfdl-wg@ogf.org <mailto:dfdl-wg@ogf.org>>
     > Date: 12/07/2019 18:14
     > Subject: [DFDL-WG] BLOB - binary large object proposal - updated
     > Sent by: "dfdl-wg" <dfdl-wg-bounces@ogf.org <
mailto:dfdl-wg-bounces@ogf.org>>
     >
     >
--------------------------------------------------------------------------------
...
>
     >
     >
     > This concept, ,which has been discussed before, is in high demand
in the
...
> Daffodil user community to enable DFDL to be used to parse image
file
...
formats.
     > The use case is to provide uniform image-metadata access without
getting
...
bogged
     > down in the large byte-array that makes up most of the file and
would be
...
very
     > large (and pointless) if rendered into XML or JSON.
     >
     > So our proposal, (which will get turned into an official
Experimental
...
feature
     > document), has been simplified and revised and is described here:
     >
     >
_https://urldefense.proofpoint.com/v2/url?u=https-3A__cwiki.apache.org_confluence_display_DAFFODIL_Proposal-253A-2BBinary-2BLarge-2BObjects-5F&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=JxLz3sp40T1X-UzhjSiHRPmWRqwL3GVgkgzT2hwgiGM&e=
...
>
     >
     > Mike Beckerle | OGF DFDL Workgroup Co-Chair | Tresys Technology |
     > _www.tresys.com_ <
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.tresys.com&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=yyn8_2c8iwgOiiXgq-ZPoPKMKJo7FKAgHWXNYR-PQ3w&e=
...
> Please note: Contributions to the DFDL Workgroup's email
discussions are
...
subject
     > to the _OGF Intellectual Property Policy_
     > <
https://urldefense.proofpoint.com/v2/url?u=http-3A__www.ogf.org_About_abt-5Fpolicies.php&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=rDxL7k1L1pv9xIdXwaEB_8Pa9Twy8dwgsicarX3l6QQ&e=
...
> --
     >   dfdl-wg mailing list
     > dfdl-wg@ogf.org <mailto:dfdl-wg@ogf.org>
     >
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=tDY6Ds7VgHOlsK5kWJ5QwigNOTbzNCEF-_fL9o7_oUc&e=
...
>
     > Unless stated otherwise above:
     > IBM United Kingdom Limited - Registered in England and Wales with
number
...
741598.
     > Registered office: PO Box 41, North Harbour, Portsmouth,
Hampshire PO6 3AU
...
>
     >
     > --
     >   dfdl-wg mailing list
     > dfdl-wg@ogf.org <mailto:dfdl-wg@ogf.org>
     >
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=tDY6Ds7VgHOlsK5kWJ5QwigNOTbzNCEF-_fL9o7_oUc&e=
...
>
--
       dfdl-wg mailing list
    dfdl-wg@ogf.org <mailto:dfdl-wg@ogf.org>
https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=tDY6Ds7VgHOlsK5kWJ5QwigNOTbzNCEF-_fL9o7_oUc&e=
...
--
  dfdl-wg mailing list
  dfdl-wg@ogf.org

https://urldefense.proofpoint.com/v2/url?u=https-3A__www.ogf.org_mailman_listinfo_dfdl-2Dwg&d=DwICAg&c=jf_iaSHvJObTbx-siA1ZOg&r=AJa9ThEymJXYnOqu84mJuw&m=RHrC943K_Ebv1XG4NHnze7AdBgWDS_Vfjb_pYsDIQ5U&s=tDY6Ds7VgHOlsK5kWJ5QwigNOTbzNCEF-_fL9o7_oUc&e= 

Unless stated otherwise above:
IBM United Kingdom Limited - Registered in England and Wales with number 
741598. 
Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU