Re: [saga-rg] proposal for extended file IO

14 Jun 2005

      On Jun 13, 2005, at 11:40 AM, Andre Merzky wrote:
...
Hi John,
Quoting [John Shalf] (Jun 13 2005):
...
Hi Andre,
I think there is a 4th possibility.  If each of the I/O operations can
be requested asynchronously, then you can get the same net effect as
the ERET/ESTO functionality of the GridFTP.
I disagree.  You can hide latency to some extend, but your
throughput suffers, utterly.
If you do a full gather-scatter I/O, then this is true (the length of 
the request equals the size of the data item returned).  Even in such a 
case, as long as the number of outstanding requests matches the 
bandwidth-delay product of the network channel (as per Little's Law), 
you still achieve full throughput.  However, the e-modes approach is 
equally bad because it simply pushes an enormous amount of complexity 
down a layer into the implementation.  I'm not sure which is worse.

So the concerns I have are as follows
	1) The negotiations to find out the available eModes seems to require 
some complex modules be installed on both the client and the server 
side of a system.  One would hope that you could implement the 
capabilities you need using a smaller subset of elemental operations.  
For instance the stdio readv() and pread() functionality to describe 
gather/scatter type operations.
	2) The implementation looks way too close to one particular data 
transport implementation.  I'm not convinced it is the best thing out 
there for gather-scatter I/O over a high-latency interface.  Again, I'd 
be interested in seeing the advantages/disadvantages of something 
related to the POSIX/XPG gather/scatter I/O implementation.  They would 
cover Jon's case.
	3) Are the EModes() guaranteed to be stateless?  In the JPEG_Block 
example you provide, its not clear what the side-effects are with 
regard to the file pointer.  If some EModes() have side-effects on the 
file pointer state, whereas others do not, its going to be impossibly 
messy.

So my example wasn't very well thought out, but the higher-level point 
I was trying to make is that I think there are more general ways to 
encode data layout descriptions for remote patterned or gather-scatter 
I/O operations than e-modes.  The arbitraryiness of the modes and their 
associated string parsing adds a sort of complexity that is a bit 
daunting at first blush.
...
Imagine Jons use case (its a worst case scenario really): You
have a remote HDF5 file, and want a hyperslab.  Really worst
case is you want every second data item.
Now, if you rely on read as is, you have to send one read
request for every single data item you want to read.  If you
interleave them asynchroneously, you get reaonable latency,
but your throughput is, well, close to zero.
If the number of outstanding requests (in terms of bytes) is equal to 
the bandwidth-delay product of the connection, then you will reach 
peak.  Sadly, the way I posed the solution would die from the excessive 
overhead of launching the threads.
...
If you want to optimize your buffersize, you have to read
more than one data item CONSECUTIVELY.  Since the use case
says you are interested in every second data item, you
effectively have to read ALL data.
You would definitely not want to read them consecutively -- You'd want 
to read all of the data items you need concurrently (thereby 
necessitating that the file pointer offset be encoded in each request). 
   I do agree with you that that my off-the cuff proposal for launching 
one async task per data item is not practica due to the excessive 
software overhead.  However, I don't see why you cannot launch as many 
concurrent requests as you need to satisfy Little's Law.
...
Same holds if you want every 10th data item - only the ratio
gets even worse.
So, interleaving works only efficently for sufficiently
_large_ independent read request (then its perfect of
course).
That is curious... Interleaving on vector machines is used for 
precisely the opposite purpose (for hundreds of very small independent 
read requests). Latency hiding and throughput are intimitely connected.

I would expect that all of the read requests for a hyperslab are 
independent provided the file pointer state is encoded in the request.  
This is precisely what the readv()/pread() does.

Should we find some case that causes problems for a readv/pread model?  
The hyperslabbing is clearly not one of those cases.
...
I think the task model and the proposed eRead model are
orthogonal.  The task model provides you asynchroneousity,
the eRead provides you efficiency (throughput).
Pipelining is used to achieve throughput.  Pipelining is achieved via 
concurrent async operations.  I agree that launching one task per byte 
is going to be inefficient, but it is inefficient because of the 
software overhead of launching a new task (not because async 
request/response is inefficient).  SCSI disk interfaces and DDR DRAMs 
depend on submitting async requests for data that get fulfilled later 
(sometimes out-of-order).  They are achieving this goal of throughput 
using a far simpler model than ERET/ESTO.  Its worth looking at simpler 
models for defining deeply pipelined remote gather/scatter operations.
...
Also, as a side note: I know about some of the dicussions
the GridFTP folx had about efficient remote file IO.  They
have been similar to this one, and the ERET/ESTO model was
the finally agreed on.
I'm not sure if the ERET/ESTO solves the problem at hand. The 
complexity has been pushed to a different layer of the software stack.
...
Cheers, Andre.
...
The only modification that would be useful to add to the tasking
interface is a notion of "readFrom()" and "writeTo()" which allows you
to specify the file offset together with the read.  Otherwise, the
statefulness of the read() call would make the entire "task" interface
useless with respect to file I/O.
-john
On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
...
Hi again,
consider following use case for remote IO.  Given a large
binary 2D field on a remote host, the client wans to access
a 2D sub portion of that field.  Dependend on the remote
file layout, that requires usually more than one read
operation, since the standard read (offset, length) is
agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a
jpg file), the number of remote operations grow very fast.
Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as
specified by SAGA's Strawman as is will only be usable for a
limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally
    Pro: - one remote op,
         - simple logic
         - remote side doesn't need to know about file
           structure
         - easily implementable on application level
    Con: - getting the header info of a 1GB data file comes
           with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a
    single request.
    Pro: - transparent to application
         - efficient
    Con: - need to know about dependencies of reads
           (a header read needed to determine size of
           field), or included explicite 'flushes'
         - need a protocol to support that
         - the remote side needs to support that
C) data specific remote ops: send a high level command,
    and get exactly what you want.
    Pro: - most efficient
    Con: - need a protocol to support that
         - the remote side needs to support that _specific_
           command
The last approach (C) is what I have best experiences with.
Also, that is what GridFTP as a common file access protocol
supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File
API of the strawman, which basically maps well to GridFTP,
but should also map to other implementations of C.
That extension would look like:
void lsEModes   (out array<string,1> emodes   );
     void eWrite      (in  string          emode,
                       in  string          spec,
                       in  string          buffer
                       out long            len_out  );
     void eRead       (in  string          emode,
                       in  string          spec,
                       out string          buffer,
                       out long            len_out  );
- hooks for gridftp-like opaque ERET/ESTO features
     - spec:  string for pattern as in GridFTP's ESTO/ERET
     - emode: string for ident.  as in GridFTP's ESTO/ERET
EMode:        a specific remote I/O command supported
lsEModes:     list the EModes available in this implementation
eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file   = SAGA::File new
("http://www.google.com/intl/en/images/logo.gif");
 my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) )
 {
   my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8");
 }
I would discourage support for B, since I do not know any
protocoll supporting that approach efficiently, and also it
needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within
any SAGA implementation, there is no need for support on API
level -- however, A is insufficient for all but some trivial
cases.
Comments welcome :-))
Cheers, Andre.
-- 
+-----------------------------------------------------------------+
| Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
| Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
| Dept. of Computer Science         | mail: merzky@cs.vu.nl       |
| De Boelelaan 1083a                | www:  http://www.merzky.net |
| 1081 HV Amsterdam, Netherlands    |                             |
+-----------------------------------------------------------------+
-- 
+-----------------------------------------------------------------+
| Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
| Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
| Dept. of Computer Science         | mail: merzky@cs.vu.nl       |
| De Boelelaan 1083a                | www:  http://www.merzky.net |
| 1081 HV Amsterdam, Netherlands    |                             |
+-----------------------------------------------------------------+