Re: [SAGA-RG] Fwd (mathijs@cs.vu.nl): Re: speeding up (???) Xterior

18 Oct 2009

      Hi Mathijs, 

I'll answer on the list if you don't mind, as several items relate
to earlier discussions we had...

Quoting [Thilo Kielmann] (Oct 15 2009):
...
food for thought...
----- Forwarded message from Mathijs den Burger <mathijs@cs.vu.nl> -----
...
Subject: Re: speeding up (???) Xterior
From: Mathijs den Burger <mathijs@cs.vu.nl>
To: Thilo Kielmann <kielmann@cs.vu.nl>
Cc: Tudor Zaharia <tudor.zaharia@gmail.com>
On Wed, 2009-10-14 at 22:38 +0200, Thilo Kielmann wrote:
...
we just had discussed performance problems with getting file size and
permission for all entries of a directory.
It seems like there might be a fast solution we may have overlooked so far:
the directory (defined in namespace??) has methods for getting file size
and permissions with a URL parameter, denoting a file you get via d.list()
Andre suggested that d.list may even cache all the infos about its entries
such that d.list() would be the only call talking to the backend.
We did not overlook that. The problem is that 'd.getSize(entryURL)' has
to be called for EACH entry you get from d.list(). An adaptor has
basically has two options of implementing that:
1. each getSize(URL) call performs a separate remote operation. That is
what all adaptors in Java SAGA currently do (e.g. via an FTP or SSH
command). With large remote directories, this results in a DOS attach of
the remote server which then shuts you out, or simply becomes
unreponsive (we see that with our FTP and SSH adaptors)
2. the adaptor caches all directory information, including file sizes,
modification dates etc. It would then have to perform only one remote
operation, and the retrieved info can be reused for subsequent
getSize(), list(), getLastModificationDate() etc. calls. However, there
is no mechanism in SAGA to invalidate or update such a cache, and it may
fill up your memory rather quickly. Also, each adaptor has to
reimplement caching. This can be circumvented by letting the engine
perform the caching, but again, there is no general mechanism to
invalidate or circumvent such a cache.
But well, there are standard ways to invalidate/refresh caches, most
commonly via a time-to-live for the cache.  Even if you set that ttl
to only a couple of seconds, you should see exactly the speedup you
are looking for.  I don't think that this is too complicated,
really (pseudo code):

  file.get_size (url u)
  {
    if ( cache.data.empty ||
         cache.created () - time.now () > 5 )
    { 
      cache.data    = dir.get_sizes (u.get_pwd ());
      cache.created = time.now ();
    }

    return cache.data [u.get_name ()];
  }

As for where to cache: engine, adaptor, or some external lib: that
is a tradeoff which is best decided in your implementation IMHO.
...
...
The solution would be to have method calls like d.getSize(List<URL>
entries). The adaptor can than retrieve the files sizes of all entries
as efficiently as possible. The current way of specifying such bulk
operations is via TasksContainers, which are very tedious to analyse.
That is an implementation problem.  I don't think we should expose
all kind of calls which are easier/faster to implement on
application level.  Nobody ever claimed SAGA is easy to implement!
;-)

Cheers, Andre.
...
...
Java SAGA does not do that at all (by default, it simply starts a Thread
for each Task: an even more effective DOS attach for many remote
directory entries).
-- 
Nothing is ever easy.

Re: [SAGA-RG] Fwd (mathijs@cs.vu.nl): Re: speeding up (???) Xterior

Andre Merzky