Re: [SAGA-RG] SAGA and advert URIs

8 Sep 2009

      Hi Andre,

Andre Merzky wrote:
...
Quoting [Bruno Harbulot] (Sep 07 2009):
...
...
While this can work at a small scale, there are a number of issues with 
this approach.
Firstly, if another adapter exists one day for another DBMS (for example 
MySQL or Oracle), which one will be used? It's not uncommon to have 
hosts that run both PostgreSQL and MySQL for example.
It's a problem similar to letting 'any://' guess the protocol. Although 
by luck 'ssh://host/file' and 'ftp://host/file' are likely to be the 
same because the underlying file system structure is the same, a 
PostgreSQL server and a MySQL server running on the same machine won't 
have the same data at all.
While this is true, this is considered to be a feature, not
a bug.  Along the same lines one could argue that the 'ftp'
schema for file access is not uniquely specifying the
adaptor to be used.  In fact, 'ftp://' could be accepted by
the gridftp adaptor, but the curl adaptor, and by a
(hypothetical) plain ftp adaptor.  Yes, one or the other may
fail to run the command - then the next in line will be
used.  Adaptor selection can be optimized, by configuration,
by heuristics, or otherwise - but that is an implementation
detail hidden from the application.
My concern (both for advert:// and any://) is more about the notion of 
identifier, which the URI is. It makes perfect sense to be able to use a 
number of adaptors for the same URI: this is indeed an implementation 
detail that ought to be hidden from the application.
However, letting the application developers and users use identifiers 
that are ambiguous is certainly going to lead to some trouble further 
down the line, more so if one day they have to talk to some other 
application, which wouldn't be surprising in the grid world.
...
...
This is in fact already an issue with respect to the PostgreSQL and the 
SQLite implementations. If a client is configured for using SQLite and 
another one is configured for using PostgreSQL, they will get mixed up 
if they try to read from and write to the same advert URI.
The complete url us unique:
advert://user:pass@host/path?dbname=mydb&dbtype=sqlite3
Yes, the short forms
any://host/path
is *not* unique - but that is up to the user to use the
convenient short form, or the full form.
...
...
Finally, SAGA is an API, but this makes SAGA enter the territory of 
network protocols. If you addressed the issues above by specifying the 
database structure and how to query it, you'd end up defining another 
protocol, which would certainly duplicate the job of protocols that 
already exist (there are a number of pub/sub protocols, for example one 
could be using Atom).
No, we do *not* define a protocol. We simply don't  We have
nowehere in our code a protocol definition.  Nor do we
actually talk on byte level on the connection.  We simply
use existing protocols like ftp, the postgres protocol, etc.
Well, you do hide the protocol, but it's there, and it's defined in a 
fuzzy way. If you do a retrieve_object on 
"advert://user:pass@host/path?dbname=mydb&dbtype=sqlite3", you imply a 
mechanism for dereferencing that URI. The API will have to find what to 
do with this URI and will have to make the connection to the appropriate 
database, with the appropriate structure. That's where you're blurring 
the line with network protocols.
...
...
In the case where identifiers are ambiguous and can point to 
several distinct things, this sounds like a fundamental architectural 
flaw (once it's released as it's the case for gsiftp URIs, it's almost 
impossible to fix [*]).
I can give you simplier examples.
http://host//etc/passwd
  ftp://host//etc/passwd
will usually not refer to the same physical file, but, for
example, to
file://host//var/http_root/etc/passwd
  file://host//var/pub/etc/passwd
and neither refers to the canonical
file://host//etc/passwd
Yes, users need to be aware of that.
Well, that's not quite the same problem as gsiftp URIs. 
"http://host/path/something" and "ftp://host/path/something" are 
fundamentally disctinct URIs and therefore identify different resources 
(which may or may not be files). Whether these resources may be aliases 
for one another (e.g. via a redirection mechanism) is a different matter.

The problem with gsiftp:// URIs is that "gsiftp://host/path/something" 
will refer to two distinct resources depending on whether you use 
globus-url-copy or the CoG kit. This really is a pain when you want to 
track data and simply refer to something independently of whether you're 
using C or Java (or any other language for that matter). In most cases, 
you have no way of knowing which implementation was talking about which 
API, even if you try.

That's the trap I'd like SAGA not to fall into, although at least SAGA 
lets you specify a given protocol (disambiguating any:// can indeed be 
done by being more specific), whereas gsiftp:// cannot be more specific.
...
...
[*] http://blog.distributedmatter.net/post/2006/12/08/gsiftp-URI-madness
As a side-note, I've been using Restlet <http://www.restlet.org/> for a 
while, and there's a couple of points that I had in mind and that may be 
of interest.

Firstly, like SAGA, the Restlet tries to provide a uniform API for a 
number of protocols, and provides a number of "connectors" that 
implement those protocols (similar to SAGA adapters). The API is 
modelled around the HTTP semantics 
<http://wiki.restlet.org/docs_1.1/13-restlet/27-restlet/130-restlet.html>. 
I think comparing the way the mappings have been done would be an 
interesting exercise (and perhaps looking into the changes from Restlet 
1.0 and 2.0 correspond to similar steps in the evolution of SAGA).

Secondly, I can't help notice the similarities between what SAGA aims 
for and the mechanisms designed into HTTP, along with the way they've 
been implemented in Restlet. For example, at an architectural level the 
issues of guessing the protocol based on the any:// or advert:// 
identifiers could be addressed by a proxy layer (not necessarily actual 
network proxies, but a proxy layer in the API). The advertising system 
could be done using PUT/GET and perhaps the Atom Publishing Protocol in 
the back.
 From what I've seen from the SAGA Shell, it looks like it's trying to 
provide a uniform interface, even for the sub-groups of classes in SAGA 
(e.g. advert, file, job). It looks like there could be a further layer 
of abstraction, providing a common interface between those types (and 
you'd probably end up with something very similar to the HTTP verbs).

I'm not saying HTTP is ideal for what SAGA is trying to achieve, but it 
looks like a number of mechanisms provided by the web architecture are 
similar to what SAGA provides as an API.

Best wishes,

Bruno.