
Hi Mathijs, I'll answer on the list if you don't mind, as several items relate to earlier discussions we had... Quoting [Thilo Kielmann] (Oct 15 2009):
food for thought...
----- Forwarded message from Mathijs den Burger <mathijs@cs.vu.nl> -----
Subject: Re: speeding up (???) Xterior From: Mathijs den Burger <mathijs@cs.vu.nl> To: Thilo Kielmann <kielmann@cs.vu.nl> Cc: Tudor Zaharia <tudor.zaharia@gmail.com>
On Wed, 2009-10-14 at 22:38 +0200, Thilo Kielmann wrote:
we just had discussed performance problems with getting file size and permission for all entries of a directory.
It seems like there might be a fast solution we may have overlooked so far: the directory (defined in namespace??) has methods for getting file size and permissions with a URL parameter, denoting a file you get via d.list()
Andre suggested that d.list may even cache all the infos about its entries such that d.list() would be the only call talking to the backend.
We did not overlook that. The problem is that 'd.getSize(entryURL)' has to be called for EACH entry you get from d.list(). An adaptor has basically has two options of implementing that:
1. each getSize(URL) call performs a separate remote operation. That is what all adaptors in Java SAGA currently do (e.g. via an FTP or SSH command). With large remote directories, this results in a DOS attach of the remote server which then shuts you out, or simply becomes unreponsive (we see that with our FTP and SSH adaptors)
2. the adaptor caches all directory information, including file sizes, modification dates etc. It would then have to perform only one remote operation, and the retrieved info can be reused for subsequent getSize(), list(), getLastModificationDate() etc. calls. However, there is no mechanism in SAGA to invalidate or update such a cache, and it may fill up your memory rather quickly. Also, each adaptor has to reimplement caching. This can be circumvented by letting the engine perform the caching, but again, there is no general mechanism to invalidate or circumvent such a cache.
But well, there are standard ways to invalidate/refresh caches, most commonly via a time-to-live for the cache. Even if you set that ttl to only a couple of seconds, you should see exactly the speedup you are looking for. I don't think that this is too complicated, really (pseudo code): file.get_size (url u) { if ( cache.data.empty || cache.created () - time.now () > 5 ) { cache.data = dir.get_sizes (u.get_pwd ()); cache.created = time.now (); } return cache.data [u.get_name ()]; } As for where to cache: engine, adaptor, or some external lib: that is a tradeoff which is best decided in your implementation IMHO.
The solution would be to have method calls like d.getSize(List<URL> entries). The adaptor can than retrieve the files sizes of all entries as efficiently as possible. The current way of specifying such bulk operations is via TasksContainers, which are very tedious to analyse.
That is an implementation problem. I don't think we should expose all kind of calls which are easier/faster to implement on application level. Nobody ever claimed SAGA is easy to implement! ;-) Cheers, Andre.
Java SAGA does not do that at all (by default, it simply starts a Thread for each Task: an even more effective DOS attach for many remote directory entries).
-- Nothing is ever easy.