Ah, am I seeing someone volonteering here? Great! :-D A. Quoting [Thorsten Schuett] (Jun 20 2005):
Of course, I like the idea adding pattern reads to saga. ;-)
At the same time I have the feeling that there must be second document. Something like the "The Annotated SAGA Reference Manual", a tutorial or sample apps written in SAGA. On the one hand you should document the ideas behind the API (why did you include readE, .... ) and on the other hand you should show how to solve common problems ("see how easy it is to create a module for server-side data processing in SAGA").
Thorsten
On Friday 17 June 2005 21:34, Andre Merzky wrote:
Hi List,
I went through the IO thread again, and also had a chat with John Shalf, and I'd like to summarize the outcome of the discussion. Please consider that as a joint proposal of John and me for inclusion in the file IO methods.
Observations:
- normal read/write has severe drawbacks on remote IO, if used extensively, both sync and async
- external preprocessing of data for read can be accomplisehd by spawning preprocessing jobs
- async is well covered by the task model
- there exists various approaches to improve throughput for IO intensive apps, amongst them:
- (A) gather/scatter (see readv (2) - (B) FALLS (regular paterns on binary data) - (C) eRead (see ERET/ESTO in gridftp)
Remarks:
- the options A, B and C show increasing powerfull expressions, but also require increasing concertation between client and server side.
- A is, being POSIX, well known
- B maps to hyperslabs pretty well, a seemingly common access pattern
- C maps GridFTP, a commonly used protocol, very well
Proposal:
- There seem advantages to A, B and C. Also, the need for more than simple read seems obvious. Hence we propose to include A, B and C into the SAGA API.
void readV (in array<ivec> ivec, out array<string> buffers ); void writeV (in array<ivec> ivec, in array<string> buffers );
void readP (in pattern pattern, out string buffer, out long len_out ); void writeP (in pattern pattern, in string buffer, out long len_out );
void lsEModes (out array
emodes ); void readE (in string emode, in string spec, out string buffer, out long len_out ); void writeE (in string emode, in string spec, in string buffer, out long len_out ); We think that adding the 7 calls does not bloat the API (although increases the file method number significantly), but will make the API much more usable for the targeted use cases.
Please comment :-)
Cheers, Andre.
Quoting [Andre Merzky] (Jun 12 2005):
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+