Quoting [Bruno Harbulot] (Sep 07 2009):
Hello,
Thank you for organising this workshop, it was very interesting and I've enjoyed it. I wasn't sure to which address or mailing list (perhaps SAGA on OGF or SAGA-Users) I should send this e-mail, please feel free to CC it as appropriate.
Happy you liked the workshop! :-) FWIW, I Cc'ed the ogf mailing list for the URL discussion - it seems you are subscribed.
I'd like to come back on the point I was trying to make about 'advert://' URIs. My understanding of how it works is that using "advert://advert-host.example/some/thing" from the API implies that: 1. The API will try to find a suitable adapter for this prefix. 2. Currently, this adapter is a PostgreSQL client that will try to connect to the PostgreSQL server on this host. 3. This PostgreSQL client needs to know the name of the database on the server and its schema. The server needs to be set up accordingly.
Almost. More precise it is like 1: our SAGA implementation will forward the URL to any adaptor which registered for the advert API (aka which implements the advert API). The adaptors can accept the URL, and act on the API call, or refuse to do so. 2. Currently, this adapter is a PostgreSQL client that will try to connect to the PostgreSQL server on *the host specified in the URL*. 3. This PostgreSQL client needs to know the name of the database on the server and its schema. The server needs to be set up accordingly. The db name can, however, be specified in the URL (advert//host/path?dbname=foo), or via daptor config files.
While this can work at a small scale, there are a number of issues with this approach.
Firstly, if another adapter exists one day for another DBMS (for example MySQL or Oracle), which one will be used? It's not uncommon to have hosts that run both PostgreSQL and MySQL for example. It's a problem similar to letting 'any://' guess the protocol. Although by luck 'ssh://host/file' and 'ftp://host/file' are likely to be the same because the underlying file system structure is the same, a PostgreSQL server and a MySQL server running on the same machine won't have the same data at all.
While this is true, this is considered to be a feature, not a bug. Along the same lines one could argue that the 'ftp' schema for file access is not uniquely specifying the adaptor to be used. In fact, 'ftp://' could be accepted by the gridftp adaptor, but the curl adaptor, and by a (hypothetical) plain ftp adaptor. Yes, one or the other may fail to run the command - then the next in line will be used. Adaptor selection can be optimized, by configuration, by heuristics, or otherwise - but that is an implementation detail hidden from the application.
This is in fact already an issue with respect to the PostgreSQL and the SQLite implementations. If a client is configured for using SQLite and another one is configured for using PostgreSQL, they will get mixed up if they try to read from and write to the same advert URI.
The complete url us unique: advert://user:pass@host/path?dbname=mydb&dbtype=sqlite3 Yes, the short forms any://host/path is *not* unique - but that is up to the user to use the convenient short form, or the full form.
Secondly, I'm not sure how security is configured, but if all the clients are configured to use the same schema name, user name and password. I've just been able to connect to the PostgreSQL database we were using during the tutorial and make a select query, simply using the username and password that are in the README file, in the SAGA source code. This relies on everyone playing nice. Even without malicious intent, accidents happen.
Sure, we are aware of this. But this is a database we use for tutorials etc - *real* applications would of course use a different and more secure setup. Security credentials would be either specified in the full URL, as shownb above, or specified via a saga::context which needs to be added to the saga::session the operation is supposed to run in: saga::context c ("my_postgres_context"); c.set_attribute ("UserID", "..."); c.set_attribute ("UserPass", "..."); saga::session s; s.add_context (c); saga::advert::entry ad (s, url); / This code MUST use the context specified above. And of course not all adaptors would need to accept that specific context, but simply would not try to do anything at all.
Finally, SAGA is an API, but this makes SAGA enter the territory of network protocols. If you addressed the issues above by specifying the database structure and how to query it, you'd end up defining another protocol, which would certainly duplicate the job of protocols that already exist (there are a number of pub/sub protocols, for example one could be using Atom).
No, we do *not* define a protocol. We simply don't We have nowehere in our code a protocol definition. Nor do we actually talk on byte level on the connection. We simply use existing protocols like ftp, the postgres protocol, etc. We *specify* a protocol to be used, in the URL scheme, or specify a wildcard (any) to leave the choice of protocol to the implementation.
Having a uniform API for a number of protocols is a good idea, but letting the API guess the protocol will undoubtedly lead to some trouble.
Yes, it may - we are aware of that. That is explicitely mentioned in the API specification. In the cases where that may lead to trouble, users SHOULD explicitely specify protocols. However, the SAGA 80:20 rule applies: using the wildcards seems ok in the vast majority of cases. We did not yet have any serious trouble with it. And if: just don't use the feature...
In the case where identifiers are ambiguous and can point to several distinct things, this sounds like a fundamental architectural flaw (once it's released as it's the case for gsiftp URIs, it's almost impossible to fix [*]).
I can give you simplier examples. http://host//etc/passwd ftp://host//etc/passwd will usually not refer to the same physical file, but, for example, to file://host//var/http_root/etc/passwd file://host//var/pub/etc/passwd and neither refers to the canonical file://host//etc/passwd Yes, users need to be aware of that. Best, Andre.
Best wishes,
Bruno.
[*] http://blog.distributedmatter.net/post/2006/12/08/gsiftp-URI-madness
-- Nothing is ever easy.
Hi Andre, Andre Merzky wrote:
Quoting [Bruno Harbulot] (Sep 07 2009):
While this can work at a small scale, there are a number of issues with this approach.
Firstly, if another adapter exists one day for another DBMS (for example MySQL or Oracle), which one will be used? It's not uncommon to have hosts that run both PostgreSQL and MySQL for example. It's a problem similar to letting 'any://' guess the protocol. Although by luck 'ssh://host/file' and 'ftp://host/file' are likely to be the same because the underlying file system structure is the same, a PostgreSQL server and a MySQL server running on the same machine won't have the same data at all.
While this is true, this is considered to be a feature, not a bug. Along the same lines one could argue that the 'ftp' schema for file access is not uniquely specifying the adaptor to be used. In fact, 'ftp://' could be accepted by the gridftp adaptor, but the curl adaptor, and by a (hypothetical) plain ftp adaptor. Yes, one or the other may fail to run the command - then the next in line will be used. Adaptor selection can be optimized, by configuration, by heuristics, or otherwise - but that is an implementation detail hidden from the application.
My concern (both for advert:// and any://) is more about the notion of identifier, which the URI is. It makes perfect sense to be able to use a number of adaptors for the same URI: this is indeed an implementation detail that ought to be hidden from the application. However, letting the application developers and users use identifiers that are ambiguous is certainly going to lead to some trouble further down the line, more so if one day they have to talk to some other application, which wouldn't be surprising in the grid world.
This is in fact already an issue with respect to the PostgreSQL and the SQLite implementations. If a client is configured for using SQLite and another one is configured for using PostgreSQL, they will get mixed up if they try to read from and write to the same advert URI.
The complete url us unique:
advert://user:pass@host/path?dbname=mydb&dbtype=sqlite3
Yes, the short forms
any://host/path
is *not* unique - but that is up to the user to use the convenient short form, or the full form.
Finally, SAGA is an API, but this makes SAGA enter the territory of network protocols. If you addressed the issues above by specifying the database structure and how to query it, you'd end up defining another protocol, which would certainly duplicate the job of protocols that already exist (there are a number of pub/sub protocols, for example one could be using Atom).
No, we do *not* define a protocol. We simply don't We have nowehere in our code a protocol definition. Nor do we actually talk on byte level on the connection. We simply use existing protocols like ftp, the postgres protocol, etc.
Well, you do hide the protocol, but it's there, and it's defined in a fuzzy way. If you do a retrieve_object on "advert://user:pass@host/path?dbname=mydb&dbtype=sqlite3", you imply a mechanism for dereferencing that URI. The API will have to find what to do with this URI and will have to make the connection to the appropriate database, with the appropriate structure. That's where you're blurring the line with network protocols.
In the case where identifiers are ambiguous and can point to several distinct things, this sounds like a fundamental architectural flaw (once it's released as it's the case for gsiftp URIs, it's almost impossible to fix [*]).
I can give you simplier examples.
http://host//etc/passwd ftp://host//etc/passwd
will usually not refer to the same physical file, but, for example, to
file://host//var/http_root/etc/passwd file://host//var/pub/etc/passwd
and neither refers to the canonical
file://host//etc/passwd
Yes, users need to be aware of that.
Well, that's not quite the same problem as gsiftp URIs. "http://host/path/something" and "ftp://host/path/something" are fundamentally disctinct URIs and therefore identify different resources (which may or may not be files). Whether these resources may be aliases for one another (e.g. via a redirection mechanism) is a different matter. The problem with gsiftp:// URIs is that "gsiftp://host/path/something" will refer to two distinct resources depending on whether you use globus-url-copy or the CoG kit. This really is a pain when you want to track data and simply refer to something independently of whether you're using C or Java (or any other language for that matter). In most cases, you have no way of knowing which implementation was talking about which API, even if you try. That's the trap I'd like SAGA not to fall into, although at least SAGA lets you specify a given protocol (disambiguating any:// can indeed be done by being more specific), whereas gsiftp:// cannot be more specific.
[*] http://blog.distributedmatter.net/post/2006/12/08/gsiftp-URI-madness
As a side-note, I've been using Restlet http://www.restlet.org/ for a while, and there's a couple of points that I had in mind and that may be of interest. Firstly, like SAGA, the Restlet tries to provide a uniform API for a number of protocols, and provides a number of "connectors" that implement those protocols (similar to SAGA adapters). The API is modelled around the HTTP semantics http://wiki.restlet.org/docs_1.1/13-restlet/27-restlet/130-restlet.html. I think comparing the way the mappings have been done would be an interesting exercise (and perhaps looking into the changes from Restlet 1.0 and 2.0 correspond to similar steps in the evolution of SAGA). Secondly, I can't help notice the similarities between what SAGA aims for and the mechanisms designed into HTTP, along with the way they've been implemented in Restlet. For example, at an architectural level the issues of guessing the protocol based on the any:// or advert:// identifiers could be addressed by a proxy layer (not necessarily actual network proxies, but a proxy layer in the API). The advertising system could be done using PUT/GET and perhaps the Atom Publishing Protocol in the back. From what I've seen from the SAGA Shell, it looks like it's trying to provide a uniform interface, even for the sub-groups of classes in SAGA (e.g. advert, file, job). It looks like there could be a further layer of abstraction, providing a common interface between those types (and you'd probably end up with something very similar to the HTTP verbs). I'm not saying HTTP is ideal for what SAGA is trying to achieve, but it looks like a number of mechanisms provided by the web architecture are similar to what SAGA provides as an API. Best wishes, Bruno.
Hi again, Quoting [Bruno Harbulot] (Sep 08 2009):
Hi Andre,
Andre Merzky wrote:
Quoting [Bruno Harbulot] (Sep 07 2009):
While this can work at a small scale, there are a number of issues with this approach.
Firstly, if another adapter exists one day for another DBMS (for example MySQL or Oracle), which one will be used? It's not uncommon to have hosts that run both PostgreSQL and MySQL for example. It's a problem similar to letting 'any://' guess the protocol. Although by luck 'ssh://host/file' and 'ftp://host/file' are likely to be the same because the underlying file system structure is the same, a PostgreSQL server and a MySQL server running on the same machine won't have the same data at all.
While this is true, this is considered to be a feature, not a bug. Along the same lines one could argue that the 'ftp' schema for file access is not uniquely specifying the adaptor to be used. In fact, 'ftp://' could be accepted by the gridftp adaptor, but the curl adaptor, and by a (hypothetical) plain ftp adaptor. Yes, one or the other may fail to run the command - then the next in line will be used. Adaptor selection can be optimized, by configuration, by heuristics, or otherwise - but that is an implementation detail hidden from the application.
My concern (both for advert:// and any://) is more about the notion of identifier, which the URI is. It makes perfect sense to be able to use a number of adaptors for the same URI: this is indeed an implementation detail that ought to be hidden from the application. However, letting the application developers and users use identifiers that are ambiguous is certainly going to lead to some trouble further down the line, more so if one day they have to talk to some other application, which wouldn't be surprising in the grid world.
Yes, exchange of these URLs with other applications is not straight forward (the other way around will work, usually work though). The API spec, however, provides something like 'url.translate (stub)' for exactly that reason. The intented use case is (using my earlier example): 1: saga::url ftp ("ftp://host/etc/passwd"); 2: saga::url http = ftp.translate ("http://host"); 3: saga::url file = ftp.translate ("file://host"); Line 1 will do what you expect. Line 2 will *throw* as it is (in our example) not possible to find a http based URL pointing to the same backend entity. Line 3 will return the url "file://host/var/pub/etc/passwd". The same should work for "any://". Now, having that call in the API is nice and well, but at the moment, I do not know if any implementation provides a sensible mechanism to really translates URLs between different backends. Anyway: should the problem arise for some user groups, it can be solved this way.
Finally, SAGA is an API, but this makes SAGA enter the territory of network protocols. If you addressed the issues above by specifying the database structure and how to query it, you'd end up defining another protocol, which would certainly duplicate the job of protocols that already exist (there are a number of pub/sub protocols, for example one could be using Atom).
No, we do *not* define a protocol. We simply don't We have nowehere in our code a protocol definition. Nor do we actually talk on byte level on the connection. We simply use existing protocols like ftp, the postgres protocol, etc.
Well, you do hide the protocol, but it's there, and it's defined in a fuzzy way. If you do a retrieve_object on "advert://user:pass@host/path?dbname=mydb&dbtype=sqlite3", you imply a mechanism for dereferencing that URI. The API will have to find what to do with this URI and will have to make the connection to the appropriate database, with the appropriate structure. That's where you're blurring the line with network protocols.
Yes, we hide the protocol, on purpose. But many applications do that, really. The URL scheme is *not* semantically bound to the network protocol. For example, there is an official 'file' scheme listed in the first URL RFC - 'file' however is not a protocol, but defines a name space to be used for entity resolution. Wikipedia says on the topic (http://en.wikipedia.org/wiki/URI_scheme): URI schemes are sometimes erroneously referred to as "protocols", or specifically as URI protocols or URL protocols, since most were originally designed to be used with a particular protocol, and often have the same name. The http scheme, for instance, is generally used for interacting with Web resources using HyperText Transfer Protocol. Today, URIs with that scheme are also used for other purposes, such as RDF resource identifiers and XML namespaces, that are not related to the protocol. Furthermore, some URI schemes are not associated with any specific protocol (e.g. "file") and many others do not use the name of a protocol as their prefix (e.g. "news"). Further, the URL RFC says: "[...] a new URL scheme must include a definition of an *algorithm for accessing of resources* [...]" (emphasis mine). We certainly do have such an algorithm (even if it is ill specified).
In the case where identifiers are ambiguous and can point to several distinct things, this sounds like a fundamental architectural flaw (once it's released as it's the case for gsiftp URIs, it's almost impossible to fix [*]).
I can give you simplier examples.
http://host//etc/passwd ftp://host//etc/passwd
will usually not refer to the same physical file, but, for example, to
file://host//var/http_root/etc/passwd file://host//var/pub/etc/passwd
and neither refers to the canonical
file://host//etc/passwd
Yes, users need to be aware of that.
Well, that's not quite the same problem as gsiftp URIs. "http://host/path/something" and "ftp://host/path/something" are fundamentally disctinct URIs and therefore identify different resources (which may or may not be files). Whether these resources may be aliases for one another (e.g. via a redirection mechanism) is a different matter.
The problem with gsiftp:// URIs is that "gsiftp://host/path/something" will refer to two distinct resources depending on whether you use globus-url-copy or the CoG kit. This really is a pain when you want to track data and simply refer to something independently of whether you're using C or Java (or any other language for that matter). In most cases, you have no way of knowing which implementation was talking about which API, even if you try.
We will not attempt to fix specific GridFTP implementations. That is left to either the adaptor, the backend, or the application.
That's the trap I'd like SAGA not to fall into, although at least SAGA lets you specify a given protocol (disambiguating any:// can indeed be done by being more specific), whereas gsiftp:// cannot be more specific.
[*] http://blog.distributedmatter.net/post/2006/12/08/gsiftp-URI-madness
As a side-note, I've been using Restlet http://www.restlet.org/ for a while, and there's a couple of points that I had in mind and that may be of interest.
Firstly, like SAGA, the Restlet tries to provide a uniform API for a number of protocols, and provides a number of "connectors" that implement those protocols (similar to SAGA adapters). The API is modelled around the HTTP semantics http://wiki.restlet.org/docs_1.1/13-restlet/27-restlet/130-restlet.html. I think comparing the way the mappings have been done would be an interesting exercise (and perhaps looking into the changes from Restlet 1.0 and 2.0 correspond to similar steps in the evolution of SAGA).
Secondly, I can't help notice the similarities between what SAGA aims for and the mechanisms designed into HTTP, along with the way they've been implemented in Restlet. For example, at an architectural level the issues of guessing the protocol based on the any:// or advert:// identifiers could be addressed by a proxy layer (not necessarily actual network proxies, but a proxy layer in the API). The advertising system could be done using PUT/GET and perhaps the Atom Publishing Protocol in the back.
From what I've seen from the SAGA Shell, it looks like it's trying to provide a uniform interface, even for the sub-groups of classes in SAGA (e.g. advert, file, job). It looks like there could be a further layer of abstraction, providing a common interface between those types (and you'd probably end up with something very similar to the HTTP verbs).
The goal of SAGA is not abstraction per se, i.e. is not to find the smallest common semantic denominator for expressing various backend operations. The aim of SAGA is rather to provide exactly those abstractions which seem to best match the application level operations performed in distributed applications. One can discuss on what level those abstractions live - but REST is certainly not amongst them. URL management in itself is really just a very small part of what SAGA tries to achieve, and was originally left out, doe to ther can of worms it is bound to open.
I'm not saying HTTP is ideal for what SAGA is trying to achieve, but it looks like a number of mechanisms provided by the web architecture are similar to what SAGA provides as an API.
Understood, and to some extend agreed. REST etc certaily try to come up with a syntactic and semantic framework for remote service interaction. I think its great for implementing service interfaces/protocols, in most cases. SAGA, however, targets a very different set of use cases, and users. Thanks, Andre. -- Nothing is ever easy.
participants (2)
-
Andre Merzky
-
Bruno Harbulot