Hi John, Quoting [John Shalf] (Jun 13 2005):
Hi Andre, I think there is a 4th possibility. If each of the I/O operations can be requested asynchronously, then you can get the same net effect as the ERET/ESTO functionality of the GridFTP.
I disagree. You can hide latency to some extend, but your throughput suffers, utterly. Imagine Jons use case (its a worst case scenario really): You have a remote HDF5 file, and want a hyperslab. Really worst case is you want every second data item. Now, if you rely on read as is, you have to send one read request for every single data item you want to read. If you interleave them asynchroneously, you get reaonable latency, but your throughput is, well, close to zero. If you want to optimize your buffersize, you have to read more than one data item CONSECUTIVELY. Since the use case says you are interested in every second data item, you effectively have to read ALL data. Same holds if you want every 10th data item - only the ratio gets even worse. So, interleaving works only efficently for sufficiently _large_ independent read request (then its perfect of course).
The advantage of simply embedding that functionality into the higher-level concept of asynchronous calls is that if the underlying library does *not* support the async operations (or some subset of the operations cannot be performed asynchronously) , you can always perform the operations synchronously and still be able present
I do not like plan A or B for the reasons you state. I do not like Plan C because it is too tightly tied to a specific data transfer system implementation. I would propose a Plan D that simply augments the Task interface of SAGA. For example, if you allowed the user to fire off a number of async read operations Task handle1= channel.read(); Task handle2=channel.read(); container.addTask(handle1); container.addTask(handle2); container.waitAll();
The read operations in this example can be submitted as an eRead
No, it cant efficiently expressed as an eRead anymore. The implementation sees only a number of reads, but won't be able to recognize any usable pattern (or at least that REALLY tough). So, the implementation cannot send the request filename hyperslab=([2,3,4][3,4,5]) but has to send each read command filename read(2,1),read(4,1),read(6,1) ... Worst case: you send more than one byte as command for every single byte you request as data (Ugh!). (BTW: Think HDF5 file driver: that is exactly the problem there: the file driver does not know about semantics anymore, but sees only read, write and seek. Hence its difficult to efficiently implement a remote HDF5 file driver. Andrei did that, but we had to smuggle semantic information down to the IO level...) I think there is NO data model agnostic way which provides good efficiency for remote data access. I'd be happily convinced otherwise, but thats what I see right now. If (only if, again: please convince me otherwise), so IF one accepts that data model info have to be part of a remote read request, an eRead thingle a la GridFTP is the best (most generic and most simple (!)) solution I know of.
operation or they can be in separate threads, or they can simply be executed synchronously when you call waitAll() (this is in fact how some of the async MPI I/O was done on the first SGI origin machines... it was meant to look asynchronous, but in fact the calls did not initiate until you did a "Wait" for them).
Anyways, using the task interface provide more degrees of freedom for implementing async I/O than simply supporting the GridFTP way of doing things and it meshes gracefully with I/O implementations that do *not* offer an underlying async execution model.
I think the task model and the proposed eRead model are orthogonal. The task model provides you asynchroneousity, the eRead provides you efficiency (throughput). Also, as a side note: I know about some of the dicussions the GridFTP folx had about efficient remote file IO. They have been similar to this one, and the ERET/ESTO model was the finally agreed on. Cheers, Andre.
The only modification that would be useful to add to the tasking interface is a notion of "readFrom()" and "writeTo()" which allows you to specify the file offset together with the read. Otherwise, the statefulness of the read() call would make the entire "task" interface useless with respect to file I/O.
-john
On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+