
Hi all, Cerial came upon an ugly problem with the current spec: the wildcards used in the namespace package collide with the introduction of URLs, as several characters used for wildcards lead to not-well formed URLs. That problem was not present back then when we used strings instead of the saga::url class. Below is an email exchange describing the problem with examples. Opinions on how to solve that _nicely_ are very welcome. Thanks, Andre. ----- Forwarded message from Andre Merzky <andre@merzky.net> -----
Quoting [Ceriel Jacobs] (Nov 20 2007):
Ceriel Jacobs wrote:
Ceriel Jacobs wrote:
Hi,
I am now looking at wildcard expansion, and am totally confused as to where/when that should take place. The ns_directory methods copy(), move(), link() and remove() seem reasonable targets, but they all take an URL parameter. This is sort of OK with the '*' wildcard, but using any of ?, [, ], {, } results in an invalid URL. For instance, ftp://ftp.cs.vu.nl/pub/ceriel/LLgen.?ar.gz is not a valid URL.
Ah! Well, the wildcard spec still assumes that the parameters are strings, not URLs :-(
Ahum, this IS a valid URL, with a query part. Anyway, not what is intended.
Right.
OK, it can be done with %-escapes: I now get a match for
ftp://ftp.cs.vu.nl/pub/ceriel/LLgen.%7Btar%2Cnoot%7D.gz
Yes, I guess that works, but that leaves the effort to write escaped charactes to the end user -- probably not what we want.
So, the problem really is to distinguish between characters which the user added to describe wildcards, and characters the user added to describe legitimate URL parts. Thats impossible I'm afraid :-(
Ugh.
Justing dumping random thoughts from here:
One option would be to forbid query parts etc on URLs. But that would be a severe limitation.
Another option would be to 'mark' wildcard characters, e.g. to escape them with '\':
ftp://ftp.cs.vu.nl/pub/ceriel/LLgen.\?ar.gz
and to transform it internally into a/multiple valid URL(s). That would imply that the saga::url class would need to be aware of the escaping (so you cannot use a native URL class), and the user still has to do some work.
Another option is to revert to strings. Which removes your parsing error, but does not solve the problem semantically - at some point, the string needs to be converted into URLs.
And yet another option is not to use wirldcards in the spec - which is not really an option at this stage I guess, and would be a pity as well.
-----
So, I do not have a good answer at the moment, and will ponder some more on this. Do you mind if I forward the question to Hartmut/Ole and to the list?
Cheers, Andre. -- No trees were destroyed in the sending of this message, however, a significant number of electrons were terribly inconvenienced.

Andre Merzky wrote:
Hi all,
Cerial came upon an ugly problem with the current spec: the wildcards used in the namespace package collide with the introduction of URLs, as several characters used for wildcards lead to not-well formed URLs. That problem was not present back then when we used strings instead of the saga::url class.
Below is an email exchange describing the problem with examples. Opinions on how to solve that _nicely_ are very welcome.
One "solution" is to explicitly add methods where you are allowed to use wildcards, for instance in namespace.ns_directory: copy (in string source, in saga::url target, in int flags = None); This, in addition to the methods that are already there (and which then don't allow for wildcard expansion anymore). These methods could throw NotImplemented when wildcards are not supported. An advantage of this approach is that it is made explicit where wildcard expansion is to be expected. This is left in the dark now. Problem is that the characters used in wildcards are now part of the "path" part of the source "url", which a.o. means that query parts are not possible here, and implementations will have to provide their own url decoder. Or, the source could be specified as being only a "path" part, which is to be resolved with respect to the directory on which the method operates. This was a problem with the earlier saga specs as well. How does the current C++ implementation deal with this? (This one is based on the earlier saga specs, is it not?). Another disadvantage is that this adds bulk: four extra methods (copy, link, move, remove) plus their async versions. I don't think it makes sense to allow wildcards in the other methods (except maybe the permissions methods). Another approach would be to have an explicit method to do wildcard expansion. For instance, in namespace.ns_directory: expand (in string pattern, out array<saga::url> urls); Here, the pattern only specifies the "path" part, but with wildcards (the directory implicitly specifies the rest of the url). I am not sure whether the resulting urls should be resolved with respect to the directory or not. I think not. Anyway, just a couple of thoughts. Ceriel

All, I did a global search for "wildcard" in the SAGA core spec. The result is that we are having three places using wildcards: 1. attributes 2. logical directory (using both attribute and path wildcards) 3. namespace.directory, using path wildcards. Attribute wildcards don't pose a problem (at least to me, or until Ceriel will find one ;-) The path wildcards from namespace.directory, however, do bring a problem, in combination with URLs. If I remember correctly, we switched from strings to URLs for a good reason. URLs, however, do not allow for wildcards, according to RFC1738. And the here mentioned query parts of URLs are for http only, and not for files as we would need them here. If we define some "URL with wildcards" that would no longer be URLs, so this is no way to go. Why do we want/need wildcards for? The core spec writes about "shell wildcards", so we want to apply a single operation to several namespace entries at a time. (e.g.: move, copy, find,...) This reminds me of bulk operations with SAGA tasks. But this also feels like "overkill" for the use case of file wildcards. My suggestion is thus to follow Ceriel (version 2): On Thu, Nov 22, 2007 at 11:12:58AM +0100, Ceriel Jacobs wrote:
Another approach would be to have an explicit method to do wildcard expansion. For instance, in namespace.ns_directory:
expand (in string pattern, out array<saga::url> urls);
Here, the pattern only specifies the "path" part, but with wildcards (the directory implicitly specifies the rest of the url). I am not sure whether the resulting urls should be resolved with respect to the directory or not. I think not.
I think we need to spend some good thoughts on getting the parameters to this call right (do we need a pattern to compose the URLs from the expanded patterns???) Besides this "expand" method, we would have to change the relevant namespace.directory methods to accept arrays of URLs instead of individual URLs. The other radical approach could be: remove file name wildcards alltogether... More thoughts? Thilo -- Thilo Kielmann http://www.cs.vu.nl/~kielmann/

Thilo Kielmann wrote:
All,
I did a global search for "wildcard" in the SAGA core spec. The result is that we are having three places using wildcards:
1. attributes 2. logical directory (using both attribute and path wildcards) 3. namespace.directory, using path wildcards.
Attribute wildcards don't pose a problem (at least to me, or until Ceriel will find one ;-)
So far, I have really let you down :-) No problems with the attribute wildcards.
The path wildcards from namespace.directory, however, do bring a problem, in combination with URLs.
If I remember correctly, we switched from strings to URLs for a good reason.
URLs, however, do not allow for wildcards, according to RFC1738. And the here mentioned query parts of URLs are for http only, and not for files as we would need them here. If we define some "URL with wildcards" that would no longer be URLs, so this is no way to go.
Why do we want/need wildcards for? The core spec writes about "shell wildcards", so we want to apply a single operation to several namespace entries at a time. (e.g.: move, copy, find,...)
This reminds me of bulk operations with SAGA tasks. But this also feels like "overkill" for the use case of file wildcards.
My suggestion is thus to follow Ceriel (version 2):
On Thu, Nov 22, 2007 at 11:12:58AM +0100, Ceriel Jacobs wrote:
Another approach would be to have an explicit method to do wildcard expansion. For instance, in namespace.ns_directory:
expand (in string pattern, out array<saga::url> urls);
Here, the pattern only specifies the "path" part, but with wildcards (the directory implicitly specifies the rest of the url). I am not sure whether the resulting urls should be resolved with respect to the directory or not. I think not.
I think we need to spend some good thoughts on getting the parameters to this call right (do we need a pattern to compose the URLs from the expanded patterns???)
Good question. I don't have an immediate answer.
Besides this "expand" method, we would have to change the relevant namespace.directory methods to accept arrays of URLs instead of individual URLs.
... or in addition to. Adds more bulk, but makes methods easier to call for the user.
The other radical approach could be: remove file name wildcards alltogether...
But the SAGA specs are quite specific that wildcards MUST be implemented, so I assume that they are required by several use-cases? Ceriel

Quoting [Ceriel Jacobs] (Nov 26 2007):
The other radical approach could be: remove file name wildcards alltogether...
But the SAGA specs are quite specific that wildcards MUST be implemented, so I assume that they are required by several use-cases?
We do not have a hard requirement for wildcards (no use case explicetliy mentiones them). Several users/use cases did, however, express the request for shell like file handling semantics - that is where we took the wildcard requirement from. So, yes, dropping would actually be an option we should consider, but it would be a pity IMHO. Andre.
Ceriel
-- No trees were destroyed in the sending of this message, however, a significant number of electrons were terribly inconvenienced.

Ceriel and I have been chatting about this issue, producing a proposal for a solution. Two observations (Thilo only) up front: 1. wildcards are ONLY applicable to the methods copy, link, move, and remove in class ns_directory, and to nothing else in the whole name space package. 2. In ns_directory, method list has a parameter pattern, while method find has name_pattern. This should both be "pattern". It refers to the same kind of thing. Further thoughts about URLs and wildcards. 3. In ns_directory, list and find with their "pattern" parameter actually refer to pathnames, relative to the current working directory (CWD). We should say that explicitly in the spec. 4. URLs, according to the RFC do NOT provide wildcards for files. (Non-)options: a) add specific wildcards (like '*') to the URLs we use. This would not be corformant to the RFC, so it would no longer be URLs. b) "Use" the query mechanism for http to express wildcards for files. While possible "in theory" this would be far from obvious, so this would NOT be anything "simple" to use. (remember the "S" in SAGA) c) Wildcard characters could be brought into URLs by %-escape sequences. Argument as with query: non-intuitive, not simple for the user. Summary: we MUST NOT introduce file wildcards to URLs. This leaves us (IOHO - Ceriel and me) with two possible options for wildcards for namespace entries (as expressed for operations on ns_directories): A. Have an additional method expand that takes a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards. expand() has an output parameter, an array of URLs, the expansion. In addition to expand(), we add versions of the methods copy, link, move, and remove from ns_directory that accept arrays of URLs instead of single URLs. (If we do not add these versions, we force the users to resort to bulk execution of tasks for a simple thing like "remove *.doc") B. Add versions of the methods copy, link, move, and remove from ns_directory that accept a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards. Comparing both options, Ceriel and myself are in favour of B. It comes with less methods and a simpler and more obvious-to-use interface. A is a very indirect solution where a user first has to build a list of URLs from a wildcard string, and then has to pass this list of URLs to, e.g., copy. With B, the user can directly pass the wildcard string to, e.g., copy. The "trick" is that the string is restricted in its expressiveness, namely to pathnames relative to the CWD. Any opinions on the proposal of implementing solution B ??? Thilo -- Thilo Kielmann http://www.cs.vu.nl/~kielmann/

All, I did not receive any reply on our proposal for resolving the wildcards issue. May I safely interpret the silence as agreement? I hereby make this the "final call" for comments on this issue. Please speak up now, or hold your piece forever! :-) Thilo On Thu, Nov 29, 2007 at 03:36:41PM +0100, Thilo Kielmann wrote:
From: Thilo Kielmann <kielmann@cs.vu.nl> To: Andre Merzky <andre@merzky.net> Cc: Ceriel Jacobs <ceriel@cs.vu.nl>, Thilo Kielmann <kielmann@cs.vu.nl>, SAGA RG <saga-rg@ogf.org>, Hartmut Kaiser <hartmut.kaiser@gmail.com> Subject: URLs and wildcards (was: More confusion)
Ceriel and I have been chatting about this issue, producing a proposal for a solution.
Two observations (Thilo only) up front:
1. wildcards are ONLY applicable to the methods copy, link, move, and remove in class ns_directory, and to nothing else in the whole name space package.
2. In ns_directory, method list has a parameter pattern, while method find has name_pattern. This should both be "pattern". It refers to the same kind of thing.
Further thoughts about URLs and wildcards.
3. In ns_directory, list and find with their "pattern" parameter actually refer to pathnames, relative to the current working directory (CWD). We should say that explicitly in the spec.
4. URLs, according to the RFC do NOT provide wildcards for files. (Non-)options:
a) add specific wildcards (like '*') to the URLs we use. This would not be corformant to the RFC, so it would no longer be URLs. b) "Use" the query mechanism for http to express wildcards for files. While possible "in theory" this would be far from obvious, so this would NOT be anything "simple" to use. (remember the "S" in SAGA) c) Wildcard characters could be brought into URLs by %-escape sequences. Argument as with query: non-intuitive, not simple for the user.
Summary: we MUST NOT introduce file wildcards to URLs.
This leaves us (IOHO - Ceriel and me) with two possible options for wildcards for namespace entries (as expressed for operations on ns_directories):
A. Have an additional method expand that takes a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards. expand() has an output parameter, an array of URLs, the expansion.
In addition to expand(), we add versions of the methods copy, link, move, and remove from ns_directory that accept arrays of URLs instead of single URLs. (If we do not add these versions, we force the users to resort to bulk execution of tasks for a simple thing like "remove *.doc")
B. Add versions of the methods copy, link, move, and remove from ns_directory that accept a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards.
Comparing both options, Ceriel and myself are in favour of B. It comes with less methods and a simpler and more obvious-to-use interface.
A is a very indirect solution where a user first has to build a list of URLs from a wildcard string, and then has to pass this list of URLs to, e.g., copy. With B, the user can directly pass the wildcard string to, e.g., copy. The "trick" is that the string is restricted in its expressiveness, namely to pathnames relative to the CWD.
Any opinions on the proposal of implementing solution B ???
Thilo -- Thilo Kielmann http://www.cs.vu.nl/~kielmann/
-- Thilo Kielmann http://www.cs.vu.nl/~kielmann/

Thilo,
I did not receive any reply on our proposal for resolving the wildcards issue.
May I safely interpret the silence as agreement?
I hereby make this the "final call" for comments on this issue.
Please speak up now, or hold your piece forever!
B. Add versions of the methods copy, link, move, and remove from ns_directory that accept a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards.
Comparing both options, Ceriel and myself are in favour of B. It comes with less methods and a simpler and more obvious-to-use interface.
A is a very indirect solution where a user first has to build a list of URLs from a wildcard string, and then has to pass this list of URLs to, e.g., copy. With B, the user can directly pass the wildcard string to, e.g., copy. The "trick" is that the string is restricted in its expressiveness, namely to pathnames relative to the CWD.
I agree B is the better way to handling things. But what's the rationale of the 'relative to the CWD' clause? Do you want to ensure the call can be completely handled by a single middleware (adaptor)? Wouldn't it be sufficient to require that wildcard characters are alowed only in the filename part of an otherwise well formed (perhaps partial) url? Regards Hartmut

Quoting [Thilo Kielmann] (Dec 02 2007):
All,
I did not receive any reply on our proposal for resolving the wildcards issue.
May I safely interpret the silence as agreement?
Sorry, I was slow in answering - but please see my other mail. Cheers, Andre.
I hereby make this the "final call" for comments on this issue.
Please speak up now, or hold your piece forever!
:-)
Thilo
On Thu, Nov 29, 2007 at 03:36:41PM +0100, Thilo Kielmann wrote:
From: Thilo Kielmann <kielmann@cs.vu.nl> To: Andre Merzky <andre@merzky.net> Cc: Ceriel Jacobs <ceriel@cs.vu.nl>, Thilo Kielmann <kielmann@cs.vu.nl>, SAGA RG <saga-rg@ogf.org>, Hartmut Kaiser <hartmut.kaiser@gmail.com> Subject: URLs and wildcards (was: More confusion)
Ceriel and I have been chatting about this issue, producing a proposal for a solution.
Two observations (Thilo only) up front:
1. wildcards are ONLY applicable to the methods copy, link, move, and remove in class ns_directory, and to nothing else in the whole name space package.
2. In ns_directory, method list has a parameter pattern, while method find has name_pattern. This should both be "pattern". It refers to the same kind of thing.
Further thoughts about URLs and wildcards.
3. In ns_directory, list and find with their "pattern" parameter actually refer to pathnames, relative to the current working directory (CWD). We should say that explicitly in the spec.
4. URLs, according to the RFC do NOT provide wildcards for files. (Non-)options:
a) add specific wildcards (like '*') to the URLs we use. This would not be corformant to the RFC, so it would no longer be URLs. b) "Use" the query mechanism for http to express wildcards for files. While possible "in theory" this would be far from obvious, so this would NOT be anything "simple" to use. (remember the "S" in SAGA) c) Wildcard characters could be brought into URLs by %-escape sequences. Argument as with query: non-intuitive, not simple for the user.
Summary: we MUST NOT introduce file wildcards to URLs.
This leaves us (IOHO - Ceriel and me) with two possible options for wildcards for namespace entries (as expressed for operations on ns_directories):
A. Have an additional method expand that takes a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards. expand() has an output parameter, an array of URLs, the expansion.
In addition to expand(), we add versions of the methods copy, link, move, and remove from ns_directory that accept arrays of URLs instead of single URLs. (If we do not add these versions, we force the users to resort to bulk execution of tasks for a simple thing like "remove *.doc")
B. Add versions of the methods copy, link, move, and remove from ns_directory that accept a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards.
Comparing both options, Ceriel and myself are in favour of B. It comes with less methods and a simpler and more obvious-to-use interface.
A is a very indirect solution where a user first has to build a list of URLs from a wildcard string, and then has to pass this list of URLs to, e.g., copy. With B, the user can directly pass the wildcard string to, e.g., copy. The "trick" is that the string is restricted in its expressiveness, namely to pathnames relative to the CWD.
Any opinions on the proposal of implementing solution B ???
Thilo -- Thilo Kielmann http://www.cs.vu.nl/~kielmann/ -- No trees were destroyed in the sending of this message, however, a significant number of electrons were terribly inconvenienced.

Hi Thilo, all, Quoting [Thilo Kielmann] (Nov 29 2007):
Ceriel and I have been chatting about this issue, producing a proposal for a solution.
Two observations (Thilo only) up front:
1. wildcards are ONLY applicable to the methods copy, link, move, and remove in class ns_directory, and to nothing else in the whole name space package.
And to permissions_allow / permissions_deny. And in list and find of course, but those take strings, not URLs, no problem here. Here we can (and should) leave the full wildcards IMHO.
2. In ns_directory, method list has a parameter pattern, while method find has name_pattern. This should both be "pattern". It refers to the same kind of thing.
Right. The parameter for find is called name_pattern to distinguish it from the additional attrib_pattern pattern in the overloaded find method in the replica package... But yes, they are the same thing. So, if you want to have the same parameter name, it should be name_pattern I guess?
Further thoughts about URLs and wildcards.
3. In ns_directory, list and find with their "pattern" parameter actually refer to pathnames, relative to the current working directory (CWD). We should say that explicitly in the spec.
4. URLs, according to the RFC do NOT provide wildcards for files.
Hmm, a mail from me seem to have gone astray? A while ago in this thread I wrote: | Quoting [Thilo Kielmann] (Nov 26 2007): | || URLs, however, do not allow for wildcards, according to RFC1738. | | Well, RFC1738 actually refers wildcards explicitely, e.g. in | Section 3.6. NEWS: | | If <newsgroup-name> is "*" (as in <URL:news:*>), it is | used to refer to "all available news groups". So, unless my interpretation is wrong, I'd say that '*' is explicitely allowed as wildcards.
(Non-)options:
a) add specific wildcards (like '*') to the URLs we use. This would not be corformant to the RFC, so it would no longer be URLs.
See above.
b) "Use" the query mechanism for http to express wildcards for files. While possible "in theory" this would be far from obvious, so this would NOT be anything "simple" to use. (remember the "S" in SAGA)
Yep, I agree.
c) Wildcard characters could be brought into URLs by %-escape sequences. Argument as with query: non-intuitive, not simple for the user.
I agree. Another options would be (also from my previous mail): | And here are two other options actually for dealing with | wildcards: | | - allow only *, not the full blown shell wirldcards | | - or use different characters for wildcards, e.g. | | data_[a-z].bin -> data_((a-z)).bin | image.?pg -> image01.#pg | | I would find the second one slightly confusing, but an | option it is.
Summary: we MUST NOT introduce file wildcards to URLs.
Hhmmmm... ;-)
This leaves us (IOHO - Ceriel and me) with two possible options for wildcards for namespace entries (as expressed for operations on ns_directories):
A. Have an additional method expand that takes a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards. expand() has an output parameter, an array of URLs, the expansion.
In addition to expand(), we add versions of the methods copy, link, move, and remove from ns_directory that accept arrays of URLs instead of single URLs. (If we do not add these versions, we force the users to resort to bulk execution of tasks for a simple thing like "remove *.doc")
B. Add versions of the methods copy, link, move, and remove from ns_directory that accept a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards.
C. - allow * as wildcard in URLs (in the path element part) - allow normal wildcards for the string pattern in list and find - for all other wildcards ([a-z], ?, {one,two,three}) use expand(), and require user level loops over te result. A and B both have the problem of bloat -- not too badly though (6 calls). B: why the limitation to relative path names?
Comparing both options, Ceriel and myself are in favour of B. It comes with less methods and a simpler and more obvious-to-use interface.
I vote for C *blush*. <F2>
A is a very indirect solution where a user first has to build a list of URLs from a wildcard string, and then has to pass this list of URLs to, e.g., copy.
Agree.
With B, the user can directly pass the wildcard string to, e.g., copy. The "trick" is that the string is restricted in its expressiveness, namely to pathnames relative to the CWD.
For C speaks that '*' is, probably, the most commonly used wildcard - so using that in the standard URL calls would help a lot. As for the other wildcards, a detour via expand does not sound too bad anymore... Cheers, Andre.
Any opinions on the proposal of implementing solution B ???
Thilo -- No trees were destroyed in the sending of this message, however, a significant number of electrons were terribly inconvenienced.

Hi all, Andre Merzky wrote:
Hi Thilo, all,
Quoting [Thilo Kielmann] (Nov 29 2007):
Ceriel and I have been chatting about this issue, producing a proposal for a solution.
Two observations (Thilo only) up front:
1. wildcards are ONLY applicable to the methods copy, link, move, and remove in class ns_directory, and to nothing else in the whole name space package.
And to permissions_allow / permissions_deny.
Agreed.
And in list and find of course, but those take strings, not URLs, no problem here. Here we can (and should) leave the full wildcards IMHO.
Agreed.
2. In ns_directory, method list has a parameter pattern, while method find has name_pattern. This should both be "pattern". It refers to the same kind of thing.
Right. The parameter for find is called name_pattern to distinguish it from the additional attrib_pattern pattern in the overloaded find method in the replica package... But yes, they are the same thing. So, if you want to have the same parameter name, it should be name_pattern I guess?
Further thoughts about URLs and wildcards.
3. In ns_directory, list and find with their "pattern" parameter actually refer to pathnames, relative to the current working directory (CWD). We should say that explicitly in the spec.
4. URLs, according to the RFC do NOT provide wildcards for files.
Hmm, a mail from me seem to have gone astray? A while ago in this thread I wrote:
| Quoting [Thilo Kielmann] (Nov 26 2007): | || URLs, however, do not allow for wildcards, according to RFC1738. | | Well, RFC1738 actually refers wildcards explicitely, e.g. in | Section 3.6. NEWS: | | If <newsgroup-name> is "*" (as in <URL:news:*>), it is | used to refer to "all available news groups".
So, unless my interpretation is wrong, I'd say that '*' is explicitely allowed as wildcards.
True, I think I mentioned in my original mail that '*' was OK, but the other wildcard characters are not.
(Non-)options:
a) add specific wildcards (like '*') to the URLs we use. This would not be corformant to the RFC, so it would no longer be URLs.
See above.
b) "Use" the query mechanism for http to express wildcards for files. While possible "in theory" this would be far from obvious, so this would NOT be anything "simple" to use. (remember the "S" in SAGA)
Yep, I agree.
c) Wildcard characters could be brought into URLs by %-escape sequences. Argument as with query: non-intuitive, not simple for the user.
I agree. Another options would be (also from my previous mail):
| And here are two other options actually for dealing with | wildcards: | | - allow only *, not the full blown shell wirldcards | | - or use different characters for wildcards, e.g. | | data_[a-z].bin -> data_((a-z)).bin | image.?pg -> image01.#pg | | I would find the second one slightly confusing, but an | option it is.
Summary: we MUST NOT introduce file wildcards to URLs.
Hhmmmm... ;-)
This leaves us (IOHO - Ceriel and me) with two possible options for wildcards for namespace entries (as expressed for operations on ns_directories):
A. Have an additional method expand that takes a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards. expand() has an output parameter, an array of URLs, the expansion.
In addition to expand(), we add versions of the methods copy, link, move, and remove from ns_directory that accept arrays of URLs instead of single URLs. (If we do not add these versions, we force the users to resort to bulk execution of tasks for a simple thing like "remove *.doc")
B. Add versions of the methods copy, link, move, and remove from ns_directory that accept a string parameter describing a pathname, relative to the CWD, (possibly) containing POSIX-style shell wildcards.
C. - allow * as wildcard in URLs (in the path element part) - allow normal wildcards for the string pattern in list and find - for all other wildcards ([a-z], ?, {one,two,three}) use expand(), and require user level loops over te result.
A and B both have the problem of bloat -- not too badly though (6 calls).
B: why the limitation to relative path names?
Not really needed, indeed, but conceptually, wildcard expansion operates on a directory, and we are talking about methods on directories here.
Comparing both options, Ceriel and myself are in favour of B. It comes with less methods and a simpler and more obvious-to-use interface.
I vote for C *blush*. <F2>
A is a very indirect solution where a user first has to build a list of URLs from a wildcard string, and then has to pass this list of URLs to, e.g., copy.
Agree.
With B, the user can directly pass the wildcard string to, e.g., copy. The "trick" is that the string is restricted in its expressiveness, namely to pathnames relative to the CWD.
For C speaks that '*' is, probably, the most commonly used wildcard - so using that in the standard URL calls would help a lot. As for the other wildcards, a detour via expand does not sound too bad anymore...
I can live with C :-) although it is a bit of an ad-hoc solution. I like B a bit better, because it is more explicit about which methods accept wildcards. Ceriel

Quoting [Ceriel Jacobs] (Dec 03 2007):
B: why the limitation to relative path names?
Not really needed, indeed, but conceptually, wildcard expansion operates on a directory, and we are talking about methods on directories here.
Sorry for being thick: yes, they operate on a directory, but how does that imply relative paths? E.g., the following calls expand on the contents of a single directory, but would be impossible with relative paths: rm /tmp/* cp /tmp/*/*.jpg /home/user/images/tmp/
With B, the user can directly pass the wildcard string to, e.g., copy. The "trick" is that the string is restricted in its expressiveness, namely to pathnames relative to the CWD.
For C speaks that '*' is, probably, the most commonly used wildcard - so using that in the standard URL calls would help a lot. As for the other wildcards, a detour via expand does not sound too bad anymore...
I can live with C :-) although it is a bit of an ad-hoc solution. I like B a bit better, because it is more explicit about which methods accept wildcards.
Good points. Thanks, Andre.
Ceriel
-- No trees were destroyed in the sending of this message, however, a significant number of electrons were terribly inconvenienced.

Andre Merzky wrote:
Quoting [Ceriel Jacobs] (Dec 03 2007):
B: why the limitation to relative path names? Not really needed, indeed, but conceptually, wildcard expansion operates on a directory, and we are talking about methods on directories here.
Sorry for being thick: yes, they operate on a directory, but how does that imply relative paths? E.g., the following calls expand on the contents of a single directory, but would be impossible with relative paths:
rm /tmp/* cp /tmp/*/*.jpg /home/user/images/tmp/
Not impossible. In Java-speak: NSDirectory tmp = new NSDirectory(session, new URL("/tmp"), ...); tmp.remove("*"); tmp.copy("*/*.jpg", new URL("/home/user/images/tmp/"), ...); Cheers, Ceriel

Quoting [Ceriel Jacobs] (Dec 04 2007):
Andre Merzky wrote:
Quoting [Ceriel Jacobs] (Dec 03 2007):
B: why the limitation to relative path names? Not really needed, indeed, but conceptually, wildcard expansion operates on a directory, and we are talking about methods on directories here.
Sorry for being thick: yes, they operate on a directory, but how does that imply relative paths? E.g., the following calls expand on the contents of a single directory, but would be impossible with relative paths:
rm /tmp/* cp /tmp/*/*.jpg /home/user/images/tmp/
Not impossible. In Java-speak:
NSDirectory tmp = new NSDirectory(session, new URL("/tmp"), ...); tmp.remove("*"); tmp.copy("*/*.jpg", new URL("/home/user/images/tmp/"), ...);
Right of course, you can always create an extra dir where the wildcard paths elements are relative to, or 'cd()' to thar dir. I did not think of that *blush*. I wonder if others do ;-) Andre. -- No trees were destroyed in the sending of this message, however, a significant number of electrons were terribly inconvenienced.

And to permissions_allow / permissions_deny.
Yep.
And in list and find of course, but those take strings, not URLs, no problem here. Here we can (and should) leave the full wildcards IMHO.
Yes, but that is an unrelated story.
2. In ns_directory, method list has a parameter pattern, while method find has name_pattern. This should both be "pattern". It refers to the same kind of thing.
Right. The parameter for find is called name_pattern to distinguish it from the additional attrib_pattern pattern in the overloaded find method in the replica package... But yes, they are the same thing. So, if you want to have the same parameter name, it should be name_pattern I guess?
OK.
Further thoughts about URLs and wildcards.
Another take on "why NOT having wildcards in URLs denoting files and directories": 1. the reason for having wildcards in the first place is to have something with the "look and feel" of POSIX shell wildcards in SAGA calls. ==> everything that contradicts this look-and-feel is to be ruled OUT 1.a. this means that all character sequences requiring octet-encoding of wildcard characters are OUT. 1.b. this further means that everything that can not be used in a straight forward way is "OUT" (meaning: everything that is NOT simple to use) 2. when using URLs we MUST conform to RFC1738 Let's look into RFC1738: (http://www.ietf.org/rfc/rfc1738.txt) 2.2 URL Character Encoding Issues Unsafe: Other characters are unsafe ... These characters are "{", "}", "|", "\", "^", "~", "[", "]", and "`". All unsafe characters must always be encoded within a URL. Reserved: The characters ";", "/", "?", ":", "@", "=" and "&" are the characters which may be reserved for special meaning within a scheme. Thus, only alphanumerics, the special characters "$-_.+!*'(),", and reserved characters used for their reserved purposes may be used unencoded within a URL. Let's look into reserved characters per protocol: FTP: Within a name or CWD component, the characters "/" and ";" are reserved and must be encoded. HTTP: Within the <path> and <searchpart> components, "/", ";", "?" are reserved. file: (no reserved characters mentioned) aside: the use of the '*' in the NEWS scheme is irrelevant here because this only applies to NNTP news, NOT to files or directories Summary: POSIX shell-like wildcards in URLs: - some characters like [ ] must be encoded - depending on the protocol, other characters MUST be encoded or not This means, we can NOT provide wildcards in URLs with an intuitive, obvious to use (e.g., protocol-independent) way, without violating RFC1738. We could, however, restrict ourselves to the '*' wildcard only, but this is a very limited form of wildcards, although freqeuntly used, not really worth being called "POSIX shell wildcards".
Hmm, a mail from me seem to have gone astray? A while ago in this thread I wrote:
So, unless my interpretation is wrong, I'd say that '*' is explicitely allowed as wildcards.
Your interpretation IS wrong (see above, this is ONLY applicable to NNTP)
| And here are two other options actually for dealing with | wildcards: | | - allow only *, not the full blown shell wirldcards
Too limited (see above).
| | - or use different characters for wildcards, e.g. | | data_[a-z].bin -> data_((a-z)).bin | image.?pg -> image01.#pg
Not just slightly but stringly confusing, no no.
B: why the limitation to relative path names?
Idea: keep URLs for absolute, global identifiers. Have strings with POSIX shell wildcards as local names, relative to the directory the operation is working on.
For C speaks that '*' is, probably, the most commonly used wildcard - so using that in the standard URL calls would help a lot. As for the other wildcards, a detour via expand does not sound too bad anymore...
It leaves us wild the feelilng of a "hack" while we could also have a clean solution: URLs without and relative strings with wildcards. Thilo -- Thilo Kielmann http://www.cs.vu.nl/~kielmann/

Hi Thilo, all, Quoting [Thilo Kielmann] (Dec 03 2007):
Further thoughts about URLs and wildcards.
Another take on "why NOT having wildcards in URLs denoting files and directories":
1. the reason for having wildcards in the first place is to have something with the "look and feel" of POSIX shell wildcards in SAGA calls.
IMHO, the reason was to simplify calls and usage, and we went for posix shell wildcards because they are known and simple...
==> everything that contradicts this look-and-feel is to be ruled OUT
Hmm, I am not sure about that. Why? You know, I am the first one to argue about staying close to posix, right? :-) But staying simple in the API is more important IMHO.
[...]
2. when using URLs we MUST conform to RFC1738
Right, we all agree by now that, apart from '*', other wildcard chars are invalid characters.
aside: the use of the '*' in the NEWS scheme is irrelevant here because this only applies to NNTP news, NOT to files or directories
NNTP is just an example. The point is that you can legally use '*' in URLs, and we are free to interprete it as a wildcard.
We could, however, restrict ourselves to the '*' wildcard only, but this is a very limited form of wildcards, although freqeuntly used, not really worth being called "POSIX shell wildcards".
Right, this would not be posix wildcard, but just the '*' wildcard.
Hmm, a mail from me seem to have gone astray? A while ago in this thread I wrote:
So, unless my interpretation is wrong, I'd say that '*' is explicitely allowed as wildcards.
Your interpretation IS wrong (see above, this is ONLY applicable to NNTP)
Hmm, I may very well be wrong, but can you please explain why? If '*' is a valid character, and, for example by NNTP is used as wildcard, why cannot '*' be used as wildcard elsewhere?
B: why the limitation to relative path names?
Idea: keep URLs for absolute, global identifiers. Have strings with POSIX shell wildcards as local names, relative to the directory the operation is working on.
Yes, got that - but that would disable rm /tmp/* for no good reason, wouldn't it? I would understand if you'd say that the string can only contain path elements of an URL or something like that. Or, better IMHO, that only the path element can contain wildcards?
For C speaks that '*' is, probably, the most commonly used wildcard - so using that in the standard URL calls would help a lot. As for the other wildcards, a detour via expand does not sound too bad anymore...
It leaves us wild the feelilng of a "hack" while we could also have a clean solution: URLs without and relative strings with wildcards.
Its a matter of taste as usual I guess. At the moment, I would very much prefer to change (simplify!) the definition of wildcards in the spec it just seems simplier(!) to me. I would also be happy by now with just using '*', and not having full posix wildcards (thus no need for expand()). You/Ceriel seem to prefer the other way around, for basically the same reason, simplicity ;-) Andre. -- No trees were destroyed in the sending of this message, however, a significant number of electrons were terribly inconvenienced.

Hi, Quoting [Thilo Kielmann] (Nov 26 2007):
All,
I did a global search for "wildcard" in the SAGA core spec. The result is that we are having three places using wildcards:
1. attributes 2. logical directory (using both attribute and path wildcards) 3. namespace.directory, using path wildcards.
Attribute wildcards don't pose a problem (at least to me, or until Ceriel will find one ;-)
Attributes will stay strings for the time beeing (i.e. until we introduce properly typed attributes). So, wildcrads should not be a problem for the moment.
The path wildcards from namespace.directory, however, do bring a problem, in combination with URLs.
If I remember correctly, we switched from strings to URLs for a good reason.
Yes, one beeing to enforce parsing on the strings - which is exactly where it bites us now :-P
URLs, however, do not allow for wildcards, according to RFC1738.
Well, RFC1738 actually refers wildcards explicitely, e.g. in 3.6. NEWS: If <newsgroup-name> is "*" (as in <URL:news:*>), it is used to refer to "all available news groups". And here are two other options actually for dealing with wildcards: - allow only *, not the full blown shell wirldcards - or use different characters for wildcards, e.g. data_[a-z].bin -> data_((a-z)).bin image.?pg -> image01.#pg I would find the second one slightly confusing, but an option it is.
And the here mentioned query parts of URLs are for http only, and not for files as we would need them here.
Well, http URLs can refer to files...
If we define some "URL with wildcards" that would no longer be URLs, so this is no way to go.
Why do we want/need wildcards for? The core spec writes about "shell wildcards", so we want to apply a single operation to several namespace entries at a time. (e.g.: move, copy, find,...)
Right.
This reminds me of bulk operations with SAGA tasks. But this also feels like "overkill" for the use case of file wildcards.
Well, it seemed sensible and easy back then when we had strings. Actually, wild cards are just an API optimization, right? It can always be done in user space... (ls + loop + filter). So we thought that wildcards can, in the worst case (e.g. if not supported by the backend), be provided by the implementation, with no penalty if compared to application level code. Yes, we can use the bulk mechanism, but that puts the burden of wildcard expansion back into user code. Unless one provides the expand method of course ;-)
My suggestion is thus to follow Ceriel (version 2):
On Thu, Nov 22, 2007 at 11:12:58AM +0100, Ceriel Jacobs wrote:
Another approach would be to have an explicit method to do wildcard expansion. For instance, in namespace.ns_directory:
expand (in string pattern, out array<saga::url> urls);
Here, the pattern only specifies the "path" part, but with wildcards (the directory implicitly specifies the rest of the url). I am not sure whether the resulting urls should be resolved with respect to the directory or not. I think not.
I think we need to spend some good thoughts on getting the parameters to this call right (do we need a pattern to compose the URLs from the expanded patterns???)
I am not sure what the last sentence means :-( the returned items _are_ URLs, so what other URLs do you want to compose? Sorry for being thick...
Besides this "expand" method, we would have to change the relevant namespace.directory methods to accept arrays of URLs instead of individual URLs.
That makes coding slightly awakward if you want to copy a single file, as you'd need to create an array for that single URL. So we would need two calls (one with array, one without) which would again bloat the API. So I'd rather vote for requireing the code to loop over the entries and to use the normal (singular) calls...
The other radical approach could be: remove file name wildcards alltogether...
More thoughts?
Thilo
My favourite at the moment: - allow * as wildcard in URLs - for all other wirldcards ([a-z], ?, {one,two,three}) use expand(), and require user level loops over te result. Cheers, Andre. -- No trees were destroyed in the sending of this message, however, a significant number of electrons were terribly inconvenienced.
participants (4)
-
Andre Merzky
-
Ceriel Jacobs
-
Hartmut Kaiser
-
Thilo Kielmann