Hi again,
consider following use case for remote IO. Given a large
binary 2D field on a remote host, the client wans to access
a 2D sub portion of that field. Dependend on the remote
file layout, that requires usually more than one read
operation, since the standard read (offset, length) is
agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a
jpg file), the number of remote operations grow very fast.
Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as
specified by SAGA's Strawman as is will only be usable for a
limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally
Pro: - one remote op,
- simple logic
- remote side doesn't need to know about file
structure
- easily implementable on application level
Con: - getting the header info of a 1GB data file comes
with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a
single request.
Pro: - transparent to application
- efficient
Con: - need to know about dependencies of reads
(a header read needed to determine size of
field), or included explicite 'flushes'
- need a protocol to support that
- the remote side needs to support that
C) data specific remote ops: send a high level command,
and get exactly what you want.
Pro: - most efficient
Con: - need a protocol to support that
- the remote side needs to support that _specific_
command
The last approach (C) is what I have best experiences with.
Also, that is what GridFTP as a common file access protocol
supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File
API of the strawman, which basically maps well to GridFTP,
but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
Andre Merzky wrote:
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
Agreed here.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
This approach is very generic on the API level (that's good) but requires exact agreement on the used command syntax for the client and the server, which may get problematic. If we go this route we will definitely end up specifying at least a minimal command subset to be supported by the eRead/eWrite commands. I simply fear we'll have the same problems we have with the GAT today. The GAT API is in principle usable in a broad range of use cases based on a generic API. The genericity is ensured by using key/value tables in the API itself, allowing quick adaptation to any concrete need. The problem is the missing specification of these key/value pairs which makes it difficult to achieve reusability. Regards Hartmut
Hallo Hartmut, Quoting [Hartmut Kaiser] (Jun 13 2005):
Agreed here.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
This approach is very generic on the API level (that's good) but requires exact agreement on the used command syntax for the client and the server, which may get problematic. If we go this route we will definitely end up specifying at least a minimal command subset to be supported by the eRead/eWrite commands.
You are right: complexity does not go away magically, but gets moved to the specification of the eModes. As for a minimal set: I do not think that this is necessary - the eMode is SUPPOSED to be application specific. OTOH, a intuitive example usable from some cases may be helpful. GridFTP ERET standard example is partial file access (IIRC: filename, offset, length). That is not very useful for SAGA, since that is already covered by the normal read/write operations.
I simply fear we'll have the same problems we have with the GAT today. The GAT API is in principle usable in a broad range of use cases based on a generic API. The genericity is ensured by using key/value tables in the API itself, allowing quick adaptation to any concrete need. The problem is the missing specification of these key/value pairs which makes it difficult to achieve reusability.
I absolutely agree that the problem lies right there: semantic overloading of strings. The situation is somewhat better than in GAT though: - the preferences in GAT are really generic, and can be used for anything. The eModes have a very limited scope, and are hence much easier to agree on between different implementations - as the mapping to GridFTP is 1:1, and GridFTP is quite commonly used, so there is at least some other instance to be used for agreement on the modes. Hence, every implementation of a eMode can be expected to do the same thing. At least there is a good chance for that. However, again: you are right. Semantic overloading of strings is not a nice thing to do, and is here only justified by a lack of obvious alternatives. Thanks, Andre.
Regards Hartmut
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
Hi Andre, Coincidentally, I'm looking at a very similar thing right now. I'm trying to extend an archive which I've been building here at CCT. In the archive currently, we have netCDF files, for the coastal modelers, which support this kind of subsetting. We also plan to roll out the archive to the physicists, who will want to put their huge HD5 files in the archive, and then do hyperslabbing on these (essentially some kind of subset, but with a cool name). I had imagined passing some specification to the archive, represented by attribute/value pairs, along with the LogicalFileName. The service on the end to prepare the data for me, and places it in a temporary store, and return me the URLs to the prepared file. I would then access the file in the normal way. When your original dataset is 1TB, you have problems. You can't simply prepare the data in the time that it takes to do a call and reply. You need to go asynchronous. With the solution I've gone for, I can simply say "this isn't ready yet, but I'm working on it" rather than returning the URLs. The user can check back later (polling), or I can tell them when it's ready (notification). Then they access the data. How do you make your proposed eRead operation "go asynchronous" if things would take a long time? Or would the first read just hang until the data was prepared? Jon. On Jun 13, 2005, at 5:38 AM, Andre Merzky wrote:
Hallo Hartmut,
Quoting [Hartmut Kaiser] (Jun 13 2005):
Agreed here.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
This approach is very generic on the API level (that's good) but requires exact agreement on the used command syntax for the client and the server, which may get problematic. If we go this route we will definitely end up specifying at least a minimal command subset to be supported by the eRead/eWrite commands.
You are right: complexity does not go away magically, but gets moved to the specification of the eModes.
As for a minimal set: I do not think that this is necessary - the eMode is SUPPOSED to be application specific. OTOH, a intuitive example usable from some cases may be helpful. GridFTP ERET standard example is partial file access (IIRC: filename, offset, length). That is not very useful for SAGA, since that is already covered by the normal read/write operations.
I simply fear we'll have the same problems we have with the GAT today. The GAT API is in principle usable in a broad range of use cases based on a generic API. The genericity is ensured by using key/value tables in the API itself, allowing quick adaptation to any concrete need. The problem is the missing specification of these key/value pairs which makes it difficult to achieve reusability.
I absolutely agree that the problem lies right there: semantic overloading of strings. The situation is somewhat better than in GAT though:
- the preferences in GAT are really generic, and can be used for anything. The eModes have a very limited scope, and are hence much easier to agree on between different implementations
- as the mapping to GridFTP is 1:1, and GridFTP is quite commonly used, so there is at least some other instance to be used for agreement on the modes. Hence, every implementation of a eMode can be expected to do the same thing. At least there is a good chance for that.
However, again: you are right. Semantic overloading of strings is not a nice thing to do, and is here only justified by a lack of obvious alternatives.
Thanks, Andre.
Regards Hartmut
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
Quoting [Jon MacLaren] (Jun 13 2005):
Hi Andre,
Coincidentally, I'm looking at a very similar thing right now. I'm trying to extend an archive which I've been building here at CCT. In the archive currently, we have netCDF files, for the coastal modelers, which support this kind of subsetting. We also plan to roll out the archive to the physicists, who will want to put their huge HD5 files in the archive, and then do hyperslabbing on these (essentially some kind of subset, but with a cool name).
Ha, you should talk to Andrei about this! We did HDF5 hyperslabbing via the GridFTP-ERET once - Andrei implemented that. That is exactly what I was targeting at :-)
I had imagined passing some specification to the archive, represented by attribute/value pairs, along with the LogicalFileName. The service on the end to prepare the data for me, and places it in a temporary store, and return me the URLs to the prepared file. I would then access the file in the normal way.
When your original dataset is 1TB, you have problems. You can't simply prepare the data in the time that it takes to do a call and reply. You need to go asynchronous. With the solution I've gone for, I can simply say "this isn't ready yet, but I'm working on it" rather than returning the URLs. The user can check back later (polling), or I can tell them when it's ready (notification). Then they access the data.
How do you make your proposed eRead operation "go asynchronous" if things would take a long time? Or would the first read just hang until the data was prepared?
Asynchroneousity (eng?) would be provided via the task interface, as before (pseudocode): sync: File file (url); file.read (len, buff, &ret_len); async: File file (url); FileTaskFactory ftf = file.createTaskFactory (); Task task = ftf.read (len, buff); task.run (); // do some other stuff here task.wait (&ret_len); There are more methods on the Task interface, for non blocking checks etc. The task models holds for all saga objects basically, so would also cover the eRead and eWrite calls. Cheers, Andre.
Jon.
On Jun 13, 2005, at 5:38 AM, Andre Merzky wrote:
Hallo Hartmut,
Quoting [Hartmut Kaiser] (Jun 13 2005):
Agreed here.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
This approach is very generic on the API level (that's good) but requires exact agreement on the used command syntax for the client and the server, which may get problematic. If we go this route we will definitely end up specifying at least a minimal command subset to be supported by the eRead/eWrite commands.
You are right: complexity does not go away magically, but gets moved to the specification of the eModes.
As for a minimal set: I do not think that this is necessary - the eMode is SUPPOSED to be application specific. OTOH, a intuitive example usable from some cases may be helpful. GridFTP ERET standard example is partial file access (IIRC: filename, offset, length). That is not very useful for SAGA, since that is already covered by the normal read/write operations.
I simply fear we'll have the same problems we have with the GAT today. The GAT API is in principle usable in a broad range of use cases based on a generic API. The genericity is ensured by using key/value tables in the API itself, allowing quick adaptation to any concrete need. The problem is the missing specification of these key/value pairs which makes it difficult to achieve reusability.
I absolutely agree that the problem lies right there: semantic overloading of strings. The situation is somewhat better than in GAT though:
- the preferences in GAT are really generic, and can be used for anything. The eModes have a very limited scope, and are hence much easier to agree on between different implementations
- as the mapping to GridFTP is 1:1, and GridFTP is quite commonly used, so there is at least some other instance to be used for agreement on the modes. Hence, every implementation of a eMode can be expected to do the same thing. At least there is a good chance for that.
However, again: you are right. Semantic overloading of strings is not a nice thing to do, and is here only justified by a lack of obvious alternatives.
Thanks, Andre.
Regards Hartmut
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
On Jun 13, 2005, at 10:00 AM, Andre Merzky wrote:
<snip> Asynchroneousity (eng?) would be provided via the task interface, as before (pseudocode):
sync: File file (url); file.read (len, buff, &ret_len);
async: File file (url); FileTaskFactory ftf = file.createTaskFactory (); Task task = ftf.read (len, buff);
task.run ();
// do some other stuff here
task.wait (&ret_len);
There are more methods on the Task interface, for non blocking checks etc. The task models holds for all saga objects basically, so would also cover the eRead and eWrite calls.
But this could take a *long* time, e.g. hours (you have to sort through 1TB of data, which is on a disk). How would a client be able to tell what was going on? Can I distinguish between: a) The remote service is preparing the data for me b) The network connection to the service has suddenly slowed down or broken, and the data can't get through. I think if your API looks like: 1. PrepareData 2. GetData then people are more likely to expect that the data preparation is going to take a while. I'm not sure that just allowing the first read to take an hour is going to encourage people to build clients that can cope well with this. I'd hit <CTRL-C> if I didn't have a better idea of what was going on. Jon.
Hi Jon, Quoting [Jon MacLaren] (Jun 13 2005):
Cc: Hartmut Kaiser
, 'Simple API for Grid Applications WG' From: Jon MacLaren Subject: Re: [saga-rg] proposal for extended file IO Date: Mon, 13 Jun 2005 10:11:57 -0500 To: Andre Merzky On Jun 13, 2005, at 10:00 AM, Andre Merzky wrote:
Asynchroneousity (eng?) would be provided via the task interface, as before (pseudocode): sync: File file (url); file.read (len, buff, &ret_len);
async: File file (url); FileTaskFactory ftf = file.createTaskFactory (); Task task = ftf.read (len, buff);
task.run (); // do some other stuff here task.wait (&ret_len);
There are more methods on the Task interface, for non blocking checks etc. The task models holds for all saga objects basically, so would also cover the eRead and eWrite calls.
But this could take a *long* time, e.g. hours (you have to sort through 1TB of data, which is on a disk). How would a client be able to tell what was going on?
Yes, that can take a long time. Hoever, the tasks have a state attached, they are either: Pending Running Finished Cancelled That state can be queried, so you know at least if the task is still alive. I could imagine specific tasks to give more detaild state or progress information, but thats not specified in the strawman currently. For example, we have been discussing progress of file transfer: would be nice if the task tells you how much of the file is transfered, or even with what throughput. But that falls more into the domain of monitoring, which was left out of the strawman intentionally, for now. Is that what you would expect in terms of feedback? If not, can you give an example?
Can I distinguish between:
a) The remote service is preparing the data for me
b) The network connection to the service has suddenly slowed down or broken, and the data can't get through.
I think if your API looks like:
1. PrepareData
2. GetData
I am not sure if that would make much difference: if PrepareData takes some hours, you are back to the original problem, aren't you? Or do I misunderstand something? Also, if your prepared data is large, or the network is slow, the read can still need a long time - same situation again... Also, you would semantically tie two calls together. For example: file.prepare ("hyperslab", "([2,3,4][5,6,7])"); file.read (20, buffer, &out); What does 20 mean? Its specific to the hyperslab, the user has to put the data together into a convenient structure. Alternative: file.prepare ("hyperslab", "([2,3,4][5,6,7])"); file.read ("hyperslab", "([1,3,4][5,6,7])"); file.read ("hyperslab", "([2,3,4][5,6,7])"); // I know the hs spec is wrong, but YOU know what I mean, // right ;-) Hmm, again, maybe I totally misunderstand you...
then people are more likely to expect that the data preparation is going to take a while.
I'm not sure that just allowing the first read to take an hour is going to encourage people to build clients that can cope well with this. I'd hit <CTRL-C> if I didn't have a better idea of what was going on.
If the first preperation takes an hour...? The again, middleware like data cutter can benefit from preprocessed data (do indexing before, or create octree structure before) - that could be done by creating a task beforehand, which prepares the data, and then do the read afterwards. Would that do what you need? // warning: Pseudo Pseudo Code... Job job ("host_A", "/bin/subsample /data/hige_file_A /tmp/small_file_B"); // wait for job completion // read prepared data File file ("gridftp://host_A//tmp/small_file_B"); file.read (100, buffer, &out); Cheers, Andre.
Jon.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
But this could take a *long* time, e.g. hours (you have to sort through 1TB of data, which is on a disk). How would a client be able to tell what was going on?
Yes, that can take a long time. Hoever, the tasks have a state attached, they are either:
Pending Running Finished Cancelled
That state can be queried, so you know at least if the task is still alive. I could imagine specific tasks to give more detaild state or progress information, but thats not specified in the strawman currently. For example, we have been discussing progress of file transfer: would be nice if the task tells you how much of the file is transfered, or even with what throughput. But that falls more into the domain of monitoring, which was left out of the strawman intentionally, for now.
Is that what you would expect in terms of feedback? If not, can you give an example?
It's not a question about functionality. More a comment about language design, and semantics. You are potentially hiding a large amount of processing behind a file read. I don't find that intuitive. Should I put code around all eReads to allow for this? With the explicit prepare, I might send a message to a service to so the prepare, then start/queue a batch job once the processing was complete. If I am sitting on a file read for an hour on a supercomputer, it's expensive. That's why I think the decoupling is better. But I suppose that I could implement the decoupled prepare/read outside of the SAGA API, which is maybe where it belongs. And the API you have is certainly fine for smaller files. Perhaps that is what you are suggesting at the end of your reply....
<snip> If the first preperation takes an hour...?
The again, middleware like data cutter can benefit from preprocessed data (do indexing before, or create octree structure before) - that could be done by creating a task beforehand, which prepares the data, and then do the read afterwards. Would that do what you need?
// warning: Pseudo Pseudo Code... Job job ("host_A", "/bin/subsample /data/hige_file_A /tmp/ small_file_B");
// wait for job completion // read prepared data File file ("gridftp://host_A//tmp/small_file_B"); file.read (100, buffer, &out);
I guess we are agreeing... Jon.
Quoting [Jon MacLaren] (Jun 13 2005):
It's not a question about functionality. More a comment about language design, and semantics. You are potentially hiding a large amount of processing behind a file read.
Ok, I see - that is true. And its intentional. If you do a job submit, you are hiding a lot of stuff as well - even more: information service gets queried for resources, a broker does intelligent (ahem) decisions, files get staged, job gets queued, runs, gets migrated, dies, files get staged back. All you see on API level is a (addmitedly complex) submit, and a simple job status.
I don't find that intuitive. Should I put code around all eReads to allow for this?
With the explicit prepare, I might send a message to a service to so the prepare, then start/queue a batch job once the processing was complete. If I am sitting on a file read for an hour on a supercomputer, it's expensive. That's why I think the decoupling is better.
But I suppose that I could implement the decoupled prepare/read outside of the SAGA API, which is maybe where it belongs.
Dunno really ;-) Well, the design contraints of SAGA are: - simple, simple, simple. Make simple things easy, make difficult things possible, leave out the rest. - only put into saga what comes up in GOOD use cases
And the API you have is certainly fine for smaller files.
Some of our use cases include large data access, for example the remote viz ones. So, small files is not good enough :-( Cheers, Andre.
Perhaps that is what you are suggesting at the end of your reply....
<snip> If the first preperation takes an hour...?
The again, middleware like data cutter can benefit from preprocessed data (do indexing before, or create octree structure before) - that could be done by creating a task beforehand, which prepares the data, and then do the read afterwards. Would that do what you need?
// warning: Pseudo Pseudo Code... Job job ("host_A", "/bin/subsample /data/hige_file_A /tmp/ small_file_B");
// wait for job completion // read prepared data File file ("gridftp://host_A//tmp/small_file_B"); file.read (100, buffer, &out);
I guess we are agreeing...
Jon.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
On Jun 13, 2005, at 7:49 AM, Jon MacLaren wrote:
Hi Andre, Coincidentally, I'm looking at a very similar thing right now. I'm trying to extend an archive which I've been building here at CCT. In the archive currently, we have netCDF files, for the coastal modelers, which support this kind of subsetting. We also plan to roll out the archive to the physicists, who will want to put their huge HD5 files in the archive, and then do hyperslabbing on these (essentially some kind of subset, but with a cool name).
This is very similar to the SRB-HDF5 archive system that they are developing at SDSC. http://hdf.ncsa.uiuc.edu/RFC/hdf5srb/ Integrating_HDF5_with_SRB_ag_talk.ppt Its very interesting that so many groups are converging on this sort of file archiving strategy.
I had imagined passing some specification to the archive, represented by attribute/value pairs, along with the LogicalFileName. The service on the end to prepare the data for me, and places it in a temporary store, and return me the URLs to the prepared file. I would then access the file in the normal way.
When your original dataset is 1TB, you have problems. You can't simply prepare the data in the time that it takes to do a call and reply. You need to go asynchronous. With the solution I've gone for, I can simply say "this isn't ready yet, but I'm working on it" rather than returning the URLs. The user can check back later (polling), or I can tell them when it's ready (notification). Then they access the data.
How do you make your proposed eRead operation "go asynchronous" if things would take a long time? Or would the first read just hang until the data was prepared?
Jon.
On Jun 13, 2005, at 5:38 AM, Andre Merzky wrote:
Hallo Hartmut,
Quoting [Hartmut Kaiser] (Jun 13 2005):
Agreed here.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
This approach is very generic on the API level (that's good) but requires exact agreement on the used command syntax for the client and the server, which may get problematic. If we go this route we will definitely end up specifying at least a minimal command subset to be supported by the eRead/eWrite commands.
You are right: complexity does not go away magically, but gets moved to the specification of the eModes.
As for a minimal set: I do not think that this is necessary - the eMode is SUPPOSED to be application specific. OTOH, a intuitive example usable from some cases may be helpful. GridFTP ERET standard example is partial file access (IIRC: filename, offset, length). That is not very useful for SAGA, since that is already covered by the normal read/write operations.
I simply fear we'll have the same problems we have with the GAT today. The GAT API is in principle usable in a broad range of use cases based on a generic API. The genericity is ensured by using key/value tables in the API itself, allowing quick adaptation to any concrete need. The problem is the missing specification of these key/value pairs which makes it difficult to achieve reusability.
I absolutely agree that the problem lies right there: semantic overloading of strings. The situation is somewhat better than in GAT though:
- the preferences in GAT are really generic, and can be used for anything. The eModes have a very limited scope, and are hence much easier to agree on between different implementations
- as the mapping to GridFTP is 1:1, and GridFTP is quite commonly used, so there is at least some other instance to be used for agreement on the modes. Hence, every implementation of a eMode can be expected to do the same thing. At least there is a good chance for that.
However, again: you are right. Semantic overloading of strings is not a nice thing to do, and is here only justified by a lack of obvious alternatives.
Thanks, Andre.
Regards Hartmut
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
This is an interesting discussion. I don't oppose of specifying a mechanism to support any high-level operations in SAGA .. but isn't any mechanism you can agree on going to be limited and limiting (Jon M. had a good example and I can come up with a couple examples myself) and the preferred way of doing this is going to be to use complementary data selection mechanisms (perhaps grid services) + SAGA binary pipes ? Isn't data selection too application specific to be included in this? Andrei I am personally going to use different mechanisms
I simply fear we'll have the same problems we have with the GAT today. The GAT API is in principle usable in a broad range of use cases based on a generic API. The genericity is ensured by using key/value tables in the API itself, allowing quick adaptation to any concrete need. The problem is the missing specification of these key/value pairs which makes it difficult to achieve reusability.
I absolutely agree that the problem lies right there: semantic overloading of strings. The situation is somewhat better than in GAT though:
- the preferences in GAT are really generic, and can be used for anything. The eModes have a very limited scope, and are hence much easier to agree on between different implementations
- as the mapping to GridFTP is 1:1, and GridFTP is quite commonly used, so there is at least some other instance to be used for agreement on the modes. Hence, every implementation of a eMode can be expected to do the same thing. At least there is a good chance for that.
However, again: you are right. Semantic overloading of strings is not a nice thing to do, and is here only justified by a lack of obvious alternatives.
Thanks, Andre.
Regards Hartmut
Quoting [Andrei Hutanu] (Jun 13 2005):
This is an interesting discussion. I don't oppose of specifying a mechanism to support any high-level operations in SAGA .. but isn't any mechanism you can agree on going to be limited and limiting (Jon M. had a good example and I can come up with a couple examples myself)
I would welcome more examples! :-)
and the preferred way of doing this is going to be to use complementary data selection mechanisms (perhaps grid services) + SAGA binary pipes ?
How would that work? Outside of SAGA I see what you mean, but in SAGA we have no (and intent no) generic mechanism to access a Grid Service, and to perform generic custom operations.
Isn't data selection too application specific to be included in this?
Ah, but the application specific part is NOT part of the eRead proposal - that merrily provides a placeholder for doing that in a clean way! Think URLs: if you open a remote file, the URL is a String, basically, and allows you to encapsulate some semantics which are transparent to the API: This is NOT a file, but can be opened as a file: http://www.google.com/search?q=SAGA&btnG=Search+the+Web So, an URL is a placeholder for additional semantic information. eRead is similar: the emode and the spec strings are placeholders for semantic information neccessary to perform the read.
Andrei
I am personally going to use different mechanisms
Which ones? ;-) Cheers, Andre.
I simply fear we'll have the same problems we have with the GAT today. The GAT API is in principle usable in a broad range of use cases based on a generic API. The genericity is ensured by using key/value tables in the API itself, allowing quick adaptation to any concrete need. The problem is the missing specification of these key/value pairs which makes it difficult to achieve reusability.
I absolutely agree that the problem lies right there: semantic overloading of strings. The situation is somewhat better than in GAT though:
- the preferences in GAT are really generic, and can be used for anything. The eModes have a very limited scope, and are hence much easier to agree on between different implementations
- as the mapping to GridFTP is 1:1, and GridFTP is quite commonly used, so there is at least some other instance to be used for agreement on the modes. Hence, every implementation of a eMode can be expected to do the same thing. At least there is a good chance for that.
However, again: you are right. Semantic overloading of strings is not a nice thing to do, and is here only justified by a lack of obvious alternatives.
Thanks, Andre.
Regards Hartmut
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
How would that work? Outside of SAGA I see what you mean, but in SAGA we have no (and intent no) generic mechanism to access a Grid Service, and to perform generic custom operations.
Outside SAGA, of course :)
Isn't data selection too application specific to be included in this?
Ah, but the application specific part is NOT part of the eRead proposal - that merrily provides a placeholder for doing that in a clean way!
I was thinking that the data selection operation as described in the proposal is a very specific operation and that there might be many remote data operations that cannot be covered by this. For example when I think of remote data selection I think more in terms of starting a remote job (using SAGA), communicating with the job using specific protocols (perhaps grid services implemented on top of the SAGA streams) and transferring data from the remote job locally using SAGA streams. The job itself is using SAGA to access the file. There's a lot of SAGA involved here but there is no eread and using eread would be a limiting factor. Eread basically means (start job, send command, receive response, end job). That's perhaps a limiting model for remote data access. My 2 cents, Andrei
Hi Andrei, Quoting [Andrei Hutanu] (Jun 14 2005):
Isn't data selection too application specific to be included in this?
Ah, but the application specific part is NOT part of the eRead proposal - that merrily provides a placeholder for doing that in a clean way!
I was thinking that the data selection operation as described in the proposal is a very specific operation and that there might be many remote data operations that cannot be covered by this. For example when I think of remote data selection I think more in terms of starting a remote job (using SAGA), communicating with the job using specific protocols (perhaps grid services implemented on top of the SAGA streams) and transferring data from the remote job locally using SAGA streams. The job itself is using SAGA to access the file.
There's a lot of SAGA involved here but there is no eread and using eread would be a limiting factor. Eread basically means (start job, send command, receive response, end job). That's perhaps a limiting model for remote data access.
I think I understand the scenario you describe, and you are right: eread is not a good model to implement that, neither is any of the other file I/O operations we have, or have been proposed. Its a client server scenarios, and streams will do fine. So, if eread does not fit, and read does not fit, don't use it. So, how is their existence limiting your scenario? I think I am missing the point (seem to do that a lot lately)... Cheers, Andre.
My 2 cents, Andrei
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
You're right .. _I_ will not use it. It will not give me the performance I need. My only concern is that if it gets included, other people might believe that it is a good model and use it :) only to find out later what the problems are. Andrei
Hi
Ah, but the application specific part is NOT part of the eRead proposal - that merrily provides a placeholder for doing that in a clean way!
I was thinking that the data selection operation as described in the proposal is a very specific operation and that there might be many remote data operations that cannot be covered by this. For example when I think of remote data selection I think more in terms of starting a remote job (using SAGA), communicating with the job using specific protocols (perhaps grid services implemented on top of the SAGA streams) and transferring data from the remote job locally using SAGA streams. The job itself is using SAGA to access the file.
There's a lot of SAGA involved here but there is no eread and using eread would be a limiting factor. Eread basically means (start job, send command, receive response, end job). That's perhaps a limiting model for remote data access.
I think I understand the scenario you describe, and you are right: eread is not a good model to implement that, neither is any of the other file I/O operations we have, or have been proposed. Its a client server scenarios, and streams will do fine.
So, if eread does not fit, and read does not fit, don't use it. So, how is their existence limiting your scenario?
I think I am missing the point (seem to do that a lot lately)...
Cheers, Andre.
My 2 cents, Andrei
Quoting [Andrei Hutanu] (Jun 14 2005):
You're right .. _I_ will not use it. It will not give me the performance I need. My only concern is that if it gets included, other people might believe that it is a good model and use it :)
:-)
only to find out later what the problems are.
Andrei
Hi
Ah, but the application specific part is NOT part of the eRead proposal - that merrily provides a placeholder for doing that in a clean way!
I was thinking that the data selection operation as described in the proposal is a very specific operation and that there might be many remote data operations that cannot be covered by this. For example when I think of remote data selection I think more in terms of starting a remote job (using SAGA), communicating with the job using specific protocols (perhaps grid services implemented on top of the SAGA streams) and transferring data from the remote job locally using SAGA streams. The job itself is using SAGA to access the file.
There's a lot of SAGA involved here but there is no eread and using eread would be a limiting factor. Eread basically means (start job, send command, receive response, end job). That's perhaps a limiting model for remote data access.
I think I understand the scenario you describe, and you are right: eread is not a good model to implement that, neither is any of the other file I/O operations we have, or have been proposed. Its a client server scenarios, and streams will do fine.
So, if eread does not fit, and read does not fit, don't use it. So, how is their existence limiting your scenario?
I think I am missing the point (seem to do that a lot lately)...
Cheers, Andre.
My 2 cents, Andrei
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
Hi Andre, I think there is a 4th possibility. If each of the I/O operations can be requested asynchronously, then you can get the same net effect as the ERET/ESTO functionality of the GridFTP. The advantage of simply embedding that functionality into the higher-level concept of asynchronous calls is that if the underlying library does *not* support the async operations (or some subset of the operations cannot be performed asynchronously) , you can always perform the operations synchronously and still be able present I do not like plan A or B for the reasons you state. I do not like Plan C because it is too tightly tied to a specific data transfer system implementation. I would propose a Plan D that simply augments the Task interface of SAGA. For example, if you allowed the user to fire off a number of async read operations Task handle1= channel.read(); Task handle2=channel.read(); container.addTask(handle1); container.addTask(handle2); container.waitAll(); The read operations in this example can be submitted as an eRead operation or they can be in separate threads, or they can simply be executed synchronously when you call waitAll() (this is in fact how some of the async MPI I/O was done on the first SGI origin machines... it was meant to look asynchronous, but in fact the calls did not initiate until you did a "Wait" for them). Anyways, using the task interface provide more degrees of freedom for implementing async I/O than simply supporting the GridFTP way of doing things and it meshes gracefully with I/O implementations that do *not* offer an underlying async execution model. The only modification that would be useful to add to the tasking interface is a notion of "readFrom()" and "writeTo()" which allows you to specify the file offset together with the read. Otherwise, the statefulness of the read() call would make the entire "task" interface useless with respect to file I/O. -john On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
Hi John, Quoting [John Shalf] (Jun 13 2005):
Hi Andre, I think there is a 4th possibility. If each of the I/O operations can be requested asynchronously, then you can get the same net effect as the ERET/ESTO functionality of the GridFTP.
I disagree. You can hide latency to some extend, but your throughput suffers, utterly. Imagine Jons use case (its a worst case scenario really): You have a remote HDF5 file, and want a hyperslab. Really worst case is you want every second data item. Now, if you rely on read as is, you have to send one read request for every single data item you want to read. If you interleave them asynchroneously, you get reaonable latency, but your throughput is, well, close to zero. If you want to optimize your buffersize, you have to read more than one data item CONSECUTIVELY. Since the use case says you are interested in every second data item, you effectively have to read ALL data. Same holds if you want every 10th data item - only the ratio gets even worse. So, interleaving works only efficently for sufficiently _large_ independent read request (then its perfect of course).
The advantage of simply embedding that functionality into the higher-level concept of asynchronous calls is that if the underlying library does *not* support the async operations (or some subset of the operations cannot be performed asynchronously) , you can always perform the operations synchronously and still be able present
I do not like plan A or B for the reasons you state. I do not like Plan C because it is too tightly tied to a specific data transfer system implementation. I would propose a Plan D that simply augments the Task interface of SAGA. For example, if you allowed the user to fire off a number of async read operations Task handle1= channel.read(); Task handle2=channel.read(); container.addTask(handle1); container.addTask(handle2); container.waitAll();
The read operations in this example can be submitted as an eRead
No, it cant efficiently expressed as an eRead anymore. The implementation sees only a number of reads, but won't be able to recognize any usable pattern (or at least that REALLY tough). So, the implementation cannot send the request filename hyperslab=([2,3,4][3,4,5]) but has to send each read command filename read(2,1),read(4,1),read(6,1) ... Worst case: you send more than one byte as command for every single byte you request as data (Ugh!). (BTW: Think HDF5 file driver: that is exactly the problem there: the file driver does not know about semantics anymore, but sees only read, write and seek. Hence its difficult to efficiently implement a remote HDF5 file driver. Andrei did that, but we had to smuggle semantic information down to the IO level...) I think there is NO data model agnostic way which provides good efficiency for remote data access. I'd be happily convinced otherwise, but thats what I see right now. If (only if, again: please convince me otherwise), so IF one accepts that data model info have to be part of a remote read request, an eRead thingle a la GridFTP is the best (most generic and most simple (!)) solution I know of.
operation or they can be in separate threads, or they can simply be executed synchronously when you call waitAll() (this is in fact how some of the async MPI I/O was done on the first SGI origin machines... it was meant to look asynchronous, but in fact the calls did not initiate until you did a "Wait" for them).
Anyways, using the task interface provide more degrees of freedom for implementing async I/O than simply supporting the GridFTP way of doing things and it meshes gracefully with I/O implementations that do *not* offer an underlying async execution model.
I think the task model and the proposed eRead model are orthogonal. The task model provides you asynchroneousity, the eRead provides you efficiency (throughput). Also, as a side note: I know about some of the dicussions the GridFTP folx had about efficient remote file IO. They have been similar to this one, and the ERET/ESTO model was the finally agreed on. Cheers, Andre.
The only modification that would be useful to add to the tasking interface is a notion of "readFrom()" and "writeTo()" which allows you to specify the file offset together with the read. Otherwise, the statefulness of the read() call would make the entire "task" interface useless with respect to file I/O.
-john
On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
On Jun 13, 2005, at 11:40 AM, Andre Merzky wrote:
Hi John,
Quoting [John Shalf] (Jun 13 2005):
Hi Andre, I think there is a 4th possibility. If each of the I/O operations can be requested asynchronously, then you can get the same net effect as the ERET/ESTO functionality of the GridFTP.
I disagree. You can hide latency to some extend, but your throughput suffers, utterly.
If you do a full gather-scatter I/O, then this is true (the length of the request equals the size of the data item returned). Even in such a case, as long as the number of outstanding requests matches the bandwidth-delay product of the network channel (as per Little's Law), you still achieve full throughput. However, the e-modes approach is equally bad because it simply pushes an enormous amount of complexity down a layer into the implementation. I'm not sure which is worse. So the concerns I have are as follows 1) The negotiations to find out the available eModes seems to require some complex modules be installed on both the client and the server side of a system. One would hope that you could implement the capabilities you need using a smaller subset of elemental operations. For instance the stdio readv() and pread() functionality to describe gather/scatter type operations. 2) The implementation looks way too close to one particular data transport implementation. I'm not convinced it is the best thing out there for gather-scatter I/O over a high-latency interface. Again, I'd be interested in seeing the advantages/disadvantages of something related to the POSIX/XPG gather/scatter I/O implementation. They would cover Jon's case. 3) Are the EModes() guaranteed to be stateless? In the JPEG_Block example you provide, its not clear what the side-effects are with regard to the file pointer. If some EModes() have side-effects on the file pointer state, whereas others do not, its going to be impossibly messy. So my example wasn't very well thought out, but the higher-level point I was trying to make is that I think there are more general ways to encode data layout descriptions for remote patterned or gather-scatter I/O operations than e-modes. The arbitraryiness of the modes and their associated string parsing adds a sort of complexity that is a bit daunting at first blush.
Imagine Jons use case (its a worst case scenario really): You have a remote HDF5 file, and want a hyperslab. Really worst case is you want every second data item.
Now, if you rely on read as is, you have to send one read request for every single data item you want to read. If you interleave them asynchroneously, you get reaonable latency, but your throughput is, well, close to zero.
If the number of outstanding requests (in terms of bytes) is equal to the bandwidth-delay product of the connection, then you will reach peak. Sadly, the way I posed the solution would die from the excessive overhead of launching the threads.
If you want to optimize your buffersize, you have to read more than one data item CONSECUTIVELY. Since the use case says you are interested in every second data item, you effectively have to read ALL data.
You would definitely not want to read them consecutively -- You'd want to read all of the data items you need concurrently (thereby necessitating that the file pointer offset be encoded in each request). I do agree with you that that my off-the cuff proposal for launching one async task per data item is not practica due to the excessive software overhead. However, I don't see why you cannot launch as many concurrent requests as you need to satisfy Little's Law.
Same holds if you want every 10th data item - only the ratio gets even worse. So, interleaving works only efficently for sufficiently _large_ independent read request (then its perfect of course).
That is curious... Interleaving on vector machines is used for precisely the opposite purpose (for hundreds of very small independent read requests). Latency hiding and throughput are intimitely connected. I would expect that all of the read requests for a hyperslab are independent provided the file pointer state is encoded in the request. This is precisely what the readv()/pread() does. Should we find some case that causes problems for a readv/pread model? The hyperslabbing is clearly not one of those cases.
I think the task model and the proposed eRead model are orthogonal. The task model provides you asynchroneousity, the eRead provides you efficiency (throughput).
Pipelining is used to achieve throughput. Pipelining is achieved via concurrent async operations. I agree that launching one task per byte is going to be inefficient, but it is inefficient because of the software overhead of launching a new task (not because async request/response is inefficient). SCSI disk interfaces and DDR DRAMs depend on submitting async requests for data that get fulfilled later (sometimes out-of-order). They are achieving this goal of throughput using a far simpler model than ERET/ESTO. Its worth looking at simpler models for defining deeply pipelined remote gather/scatter operations.
Also, as a side note: I know about some of the dicussions the GridFTP folx had about efficient remote file IO. They have been similar to this one, and the ERET/ESTO model was the finally agreed on.
I'm not sure if the ERET/ESTO solves the problem at hand. The complexity has been pushed to a different layer of the software stack.
Cheers, Andre.
The only modification that would be useful to add to the tasking interface is a notion of "readFrom()" and "writeTo()" which allows you to specify the file offset together with the read. Otherwise, the statefulness of the read() call would make the entire "task" interface useless with respect to file I/O.
-john
On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
Quoting [John Shalf] (Jun 14 2005):
On Jun 13, 2005, at 11:40 AM, Andre Merzky wrote:
Hi John,
Quoting [John Shalf] (Jun 13 2005):
Hi Andre, I think there is a 4th possibility. If each of the I/O operations can be requested asynchronously, then you can get the same net effect as the ERET/ESTO functionality of the GridFTP.
I disagree. You can hide latency to some extend, but your throughput suffers, utterly.
If you do a full gather-scatter I/O, then this is true (the length of the request equals the size of the data item returned). Even in such a case, as long as the number of outstanding requests matches the bandwidth-delay product of the network channel (as per Little's Law), you still achieve full throughput. However, the e-modes approach is equally bad because it simply pushes an enormous amount of complexity down a layer into the implementation. I'm not sure which is worse.
:-)
So the concerns I have are as follows 1) The negotiations to find out the available eModes seems to require some complex modules be installed on both the client and the server side of a system.
Only potentially complex, but yes, thats right.
One would hope that you could implement the capabilities you need using a smaller subset of elemental operations. For instance the stdio readv() and pread() functionality to describe gather/scatter type operations.
Its just not always possible. For example, eread would allow you to specify a subset of a jpeg image. That cannot be expressed as read operations at all. OTOH, one can argue that such operations allow for even more semantic uncertainty... From application point of view its very useful.
2) The implementation looks way too close to one particular data transport implementation. I'm not convinced it is the best thing out there for gather-scatter I/O over a high-latency interface. Again, I'd be interested in seeing the advantages/disadvantages of something related to the POSIX/XPG gather/scatter I/O implementation. They would cover Jon's case.
It looks close to GridFTP, granted, but the idea _is_ generic. Basically it says: describe you request as a string (== opaque), and you get what you want. I can't see that throughput really would peak - see below.
3) Are the EModes() guaranteed to be stateless? In the JPEG_Block example you provide, its not clear what the side-effects are with regard to the file pointer. If some EModes() have side-effects on the file pointer state, whereas others do not, its going to be impossibly messy.
Yes, emodes are supposed to be stateless. They don't respect and don't move the file pointer. That would be messy indeed (jpeg).
So my example wasn't very well thought out, but the higher-level point I was trying to make is that I think there are more general ways to encode data layout descriptions for remote patterned or gather-scatter I/O operations than e-modes. The arbitraryiness of the modes and their associated string parsing adds a sort of complexity that is a bit daunting at first blush.
Ha :-) As I learned first of ERET, I was afraid that people start to send large XML formatted data requests, and found that idea terrible - sounds like an abuse of a data access thing for a multi purpose protocoll. OTOH, its ease of use for HDF5 hyperslabs is utterly convincing I think. The _request_ size is a sring describing a hyperslab. Whatever intelligence you have in gather-scatter I/O, the request size for a hyperslab van easily match the size of the hyperslab itself, or exceed it (readv needs one offset and one length to read a single byte. If your data are scattered bytewise...)
Imagine Jons use case (its a worst case scenario really): You have a remote HDF5 file, and want a hyperslab. Really worst case is you want every second data item.
Now, if you rely on read as is, you have to send one read request for every single data item you want to read. If you interleave them asynchroneously, you get reaonable latency, but your throughput is, well, close to zero.
If the number of outstanding requests (in terms of bytes) is equal to the bandwidth-delay product of the connection, then you will reach peak. Sadly, the way I posed the solution would die from the excessive overhead of launching the threads.
I am sure that does not scale. If a hyperslab describes one megabyte of scattered data (in byte granularity, say: subsample a 3D scalar field for lowres volrendering), then I have 1 million read/write requests on the wire, each one with its protocoll overhead, processing overhead etc etc. Mathematically you might be right, but in praxis that won't do any good I think. eread has one small request, and one large response. If an implementation thinkgs that the large response is better spit up (udpblast), then fine. Thats possible. The way around is not possible (or much harder).
If you want to optimize your buffersize, you have to read more than one data item CONSECUTIVELY. Since the use case says you are interested in every second data item, you effectively have to read ALL data.
You would definitely not want to read them consecutively -- You'd want to read all of the data items you need concurrently (thereby necessitating that the file pointer offset be encoded in each request). I do agree with you that that my off-the cuff proposal for launching one async task per data item is not practica due to the excessive software overhead. However, I don't see why you cannot launch as many concurrent requests as you need to satisfy Little's Law.
Little's Law does not really apply I think. It assumes that the items on the wire are identical, and require same time. If yoy read byte wise, that just doesn't apply anymore: the overhead gets larger than the payload. So, the law of course holds, but its applied to different entities...
Same holds if you want every 10th data item - only the ratio gets even worse. So, interleaving works only efficently for sufficiently _large_ independent read request (then its perfect of course).
That is curious... Interleaving on vector machines is used for precisely the opposite purpose (for hundreds of very small independent read requests). Latency hiding and throughput are intimitely connected.
I would expect that all of the read requests for a hyperslab are independent provided the file pointer state is encoded in the request. This is precisely what the readv()/pread() does.
Should we find some case that causes problems for a readv/pread model? The hyperslabbing is clearly not one of those cases.
It IS! Again, for a hyperslab requesting every other byte from a file, you send two bytes as request: one for offset, one for length. Additionally, you have some overhead for protocol. Additionally, you force the remote side to process the request like this: You cannot use hdf5 for efficient hyperslab IO, but have to use read/seek. Additionally, the response is equally bloated, because you need to separate the individual response blocks again (that can be avoided by matchin the repsonse to the original request I guess). If you want a more obvious example: jpeg subset. Its impossible to express in gather-scatter IO.
I think the task model and the proposed eRead model are orthogonal. The task model provides you asynchroneousity, the eRead provides you efficiency (throughput).
Pipelining is used to achieve throughput. Pipelining is achieved via concurrent async operations. I agree that launching one task per byte is going to be inefficient, but it is inefficient because of the software overhead of launching a new task (not because async request/response is inefficient). SCSI disk interfaces and DDR DRAMs depend on submitting async requests for data that get fulfilled later (sometimes out-of-order). They are achieving this goal of throughput using a far simpler model than ERET/ESTO. Its worth looking at simpler models for defining deeply pipelined remote gather/scatter operations.
Also, as a side note: I know about some of the dicussions the GridFTP folx had about efficient remote file IO. They have been similar to this one, and the ERET/ESTO model was the finally agreed on.
I'm not sure if the ERET/ESTO solves the problem at hand. The complexity has been pushed to a different layer of the software stack.
Yes, right! Thats the point: it allows to push semantic information to a level where it can be efficiently be used. All other approaches I know strip the semantic information, and boil the request down to generic small ops (as readv). Really, I do not know _any_ implementation which can do subsampling on remote data efficiently with small ops as request, instead of a _semantic_ description of the subsampling. Chees, Andre :-))
Cheers, Andre.
The only modification that would be useful to add to the tasking interface is a notion of "readFrom()" and "writeTo()" which allows you to specify the file offset together with the read. Otherwise, the statefulness of the read() call would make the entire "task" interface useless with respect to file I/O.
-john
On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
Quoting [John Shalf] (Jun 14 2005):
Should we find some case that causes problems for a readv/pread model? The hyperslabbing is clearly not one of those cases.
Actually, how would you do an HDF5 hyperslab via readv? The only way I see is instrumenting the HDF5 library, and write a readv file driver - but then you would not use SAGA anyway, that not application level anymore. If you want to read hyperslabs on an HDF5 file on application level with readv, you would need to mimic the HDF5 lib in order to find the offset for the data set, and would need to know details about HDF5 file structure and data layout. Compared to that, eread really is simplier to the application. Here an example we used for hyperslabbing a 3D scalar field: snprintf (pattern1, 255, "(%d, %d, %d, %d)" , start1, stop1, stride1, reps1); snprintf (pattern2, 255, "(%d, %d, %d, %d, %s)", start2, stop2, stride2, reps2, pattern1); snprintf (pattern3, 255, "(%d, %d, %d, %d, %s)", start3, stop3, stride3, reps3, pattern2); res = file.eRead (pattern3, (char*) buf, buffer_size); start, stop, stride, reps corespond directly to the HDF5 semantics. So, the semantic info is indeed maintained on appliation level, and, as you said before, its interpretation is pushed to lower levels. How would that look for recv? Cheers, Andre. -- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
On Jun 14, 2005, at 1:24 AM, Andre Merzky wrote:
Quoting [John Shalf] (Jun 14 2005):
Should we find some case that causes problems for a readv/pread model? The hyperslabbing is clearly not one of those cases. Actually, how would you do an HDF5 hyperslab via readv? The only way I see is instrumenting the HDF5 library, and write a readv file driver - but then you would not use SAGA anyway, that not application level anymore.
The same problem exists with any of the proposed solutions, including eRead. So I'm not sure if I see the point here.
If you want to read hyperslabs on an HDF5 file on application level with readv, you would need to mimic the HDF5 lib in order to find the offset for the data set, and would need to know details about HDF5 file structure and data layout.
Someone will need to solve the very same problem in order to implement an HDF5-specific eRead interface.
Compared to that, eread really is simplier to the application. Here an example we used for hyperslabbing a 3D scalar field:
snprintf (pattern1, 255, "(%d, %d, %d, %d)" , start1, stop1, stride1, reps1); snprintf (pattern2, 255, "(%d, %d, %d, %d, %s)", start2, stop2, stride2, reps2, pattern1); snprintf (pattern3, 255, "(%d, %d, %d, %d, %s)", start3, stop3, stride3, reps3, pattern2); res = file.eRead (pattern3, (char*) buf, buffer_size);
So you would actually need to embed this in-situ with your HDF5 code? Or would you go through the HDF5 libraries so that you can push that information string down to the driver layer? Its not clear where exactly you place these calls. And when you *do* insert these calls, it requires some understanding of the HDF5 internal file layout. Or are we going to ditch the HDF5 API and use eRead instead? How then do we use eRead to manage all of the other HDF5 features like compression, groups, iteration etc.??? What is the string spec for an HDF5 group iterator using eRead strings? This is why I fail to see the benefits of the eRead interface (it didn't prevent us from mucking with the guts of HDF5 if you want to preserve the HDF5 API, but it also didn't reduce complexity for the user if you are going to replace the HDF5 APIs with these stringy pattern requests).
start, stop, stride, reps corespond directly to the HDF5 semantics. So, the semantic info is indeed maintained on appliation level, and, as you said before, its interpretation is pushed to lower levels.
It looks like you will end up encoding the entire HDF5 API as eRead pattern strings and push it to the other end of a client-server connection. Again, I'm not sure if we made life easer for the remote HDF5 people.
How would that look for recv?
What I was thinking is that developers of HDF5 may have an interest in defining vector or patterned read operations at the VFD layer of their interface. This would enable them to propagate the kind of information you are attempting to encode in eRead strings down to the driver where vector-read interfaces can take advantage of them for deeper pipelining of high-latency operations. (they could, for instance, use some of the methods that Thorsten was referring to, or they could use vread/vwrite type operations). So the issue is that 1) if you use eRead to replace the HDF5 API, then we are talking about an enormously complex string-encoding interface. 2) if you use eRead in the VFD, then you have to instrument HDF5 to propagate information about patterned reads down the driver layer. That is of course the same thing you need if you use vread()/readp() (or any of the interfaces that Thorston described). So I don't see much of a difference in capability there except that vread/readp already has information in a form that you can do I/O with. With eRead, you still have to go through and parse some strings to gain access to the same information about the pattern of reads/writes? So its not merely that eRead is pushing complexity to a different layer... I don't see where it is reducing complexity.
Cheers, Andre.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
Quoting [John Shalf] (Jun 14 2005):
On Jun 14, 2005, at 1:24 AM, Andre Merzky wrote:
Quoting [John Shalf] (Jun 14 2005):
Should we find some case that causes problems for a readv/pread model? The hyperslabbing is clearly not one of those cases. Actually, how would you do an HDF5 hyperslab via readv? The only way I see is instrumenting the HDF5 library, and write a readv file driver - but then you would not use SAGA anyway, that not application level anymore.
The same problem exists with any of the proposed solutions, including eRead. So I'm not sure if I see the point here.
Hm, sorry that I communicate so badly: but that CAN be solved with eRead - and thats exactly the advantage. We implemented that once in a different lib, and it workd like a charm. The code I included below is from a real client (the call was named iowrap_pread instead of file.eRead though ;)
If you want to read hyperslabs on an HDF5 file on application level with readv, you would need to mimic the HDF5 lib in order to find the offset for the data set, and would need to know details about HDF5 file structure and data layout.
Someone will need to solve the very same problem in order to implement an HDF5-specific eRead interface.
Compared to that, eread really is simplier to the application. Here an example we used for hyperslabbing a 3D scalar field:
snprintf (pattern1, 255, "(%d, %d, %d, %d)" , start1, stop1, stride1, reps1); snprintf (pattern2, 255, "(%d, %d, %d, %d, %s)", start2, stop2, stride2, reps2, pattern1); snprintf (pattern3, 255, "(%d, %d, %d, %d, %s)", start3, stop3, stride3, reps3, pattern2); res = file.eRead (pattern3, (char*) buf, buffer_size);
So you would actually need to embed this in-situ with your HDF5 code? Or would you go through the HDF5 libraries so that you can push that information string down to the driver layer? Its not clear where exactly you place these calls.
This call goes into the application! That is supposed to be the saga level. The HDF5 lib does not come into play on the local host at all, but only on the remote host - where the eRead request is received, translated into a nativ HDF5 HS read (translation is simple), and the resulting data are returned. That is why I think SAGA is a good place for eRead - it IS application level...
And when you *do* insert these calls, it requires some understanding of the HDF5 internal file layout. Or are we going to ditch the HDF5 API and use eRead instead? How then do we use eRead to manage all of the other HDF5 features like compression, groups, iteration etc.??? What is the string spec for an HDF5 group iterator using eRead strings?
Ah, right, now I see why we are running circles :-) Imagine a remote web service providing access to HDF5 files. A simple version would provide read and write call only, a more sophisticated version would provide group iterations etc. However, the service would come up with some interface, which resembles HDF5 somewhat, but is probably more taylored toward the specific use case. eRead is nothing but a medium to communicate with such a service, and with similar services. It cannot replace HDF5, but can help in _application specific_ usage of a service providing access to an HDF5 file. As you said before: semantics gets pushed down the pipe. That is right: it gets pushed over the wire, to the remote side, and interpreted there. HOW you specify your semantics in an eRead string is up to the service definition and your use case. app. -> eread -> wire -> service -> HDF5 -> localVFD -> file
This is why I fail to see the benefits of the eRead interface (it didn't prevent us from mucking with the guts of HDF5 if you want to preserve the HDF5 API, but it also didn't reduce complexity for the user if you are going to replace the HDF5 APIs with these stringy pattern requests).
Nop, its not supposed to replace HDF5. Its also not supposed to replace libjpeg, libtiff, ... - you name it. It does not solve world problems. All it does is: it provides the ability to have application specific semantics pushed to the remote side, where it can be efficiently interpreted. The other solutions don't provide that. If you need the HDF5 API, you use the HDF5 api, not SAGA.
start, stop, stride, reps corespond directly to the HDF5 semantics. So, the semantic info is indeed maintained on appliation level, and, as you said before, its interpretation is pushed to lower levels.
It looks like you will end up encoding the entire HDF5 API as eRead pattern strings and push it to the other end of a client-server connection. Again, I'm not sure if we made life easer for the remote HDF5 people.
How would that look for recv?
What I was thinking is that developers of HDF5 may have an interest in defining vector or patterned read operations at the VFD layer of their interface. This would enable them to propagate the kind of information you are attempting to encode in eRead strings down to the driver where vector-read interfaces can take advantage of them for deeper pipelining of high-latency operations. (they could, for instance, use some of the methods that Thorsten was referring to, or they could use vread/vwrite type operations).
So the issue is that 1) if you use eRead to replace the HDF5 API, then we are talking about an enormously complex string-encoding interface. 2) if you use eRead in the VFD, then you have to instrument HDF5 to propagate information about patterned reads down the driver layer. That is of course the same thing you need if you use vread()/readp() (or any of the interfaces that Thorston described). So I don't see much of a difference in capability there except that vread/readp already has information in a form that you can do I/O with. With eRead, you still have to go through and parse some strings to gain access to the same information about the pattern of reads/writes?
I did not assume that SAGA would be the right thing to use to implement a HDF5-VFD. That is not exactly the application community SAGA is targeting at I think. application -> HDF5 -> sagaVFD -> saga -> gridftp (or so) -> file But I see now, and agree: on VFD level, eRead does not buy you much if compared to pitfalls (I'm still unsure about vread, but that won't help this discussion ;-)
So its not merely that eRead is pushing complexity to a different layer... I don't see where it is reducing complexity.
Maybe we should go away from HDF5. Assume a application specific binary file. You want subsampling. Locally you do (seeked before): for ( int x = 0; x < X_MAX / 2; x++ ) { for ( int y = 0; y < Y_MAX / 2; x++ ) { for ( int z = 0; z < Z_MAX / 2; x++ ) { data[x][y][z] = my_file_read (x*2, y*2, z*2); } } } In SAGA now, that is the same: it would call read and seek so and so often. SAGA with readv would allow you to do: for ( int x = 0; x < X_MAX / 2; x++ ) { for ( int y = 0; y < Y_MAX / 2; x++ ) { for ( int z = 0; z < Z_MAX / 2; x++ ) { iovecs[n].iov_base = ... iovecs[n].iov_len = 1; n++; } } } file.readv (iovecs, data, n); SAGA with eread would allow you to do: snprintf (request, 255, "downsample %d %d %d %d", offset, 2, 2, 2); file.eRead (request, data, n); Shorter, but it requires a infrastructure which understands the request (well, for readv you need also a remote counterpart, but that can be agnostic to semantics...). readv is more posix like, and more generic. It always works if you are on read level (e.g. HDF5 VFD layer ;-). eread is more powerful: it allows applicatoin specific optimization which is not achievable with readv (the size of the iovecs in the read request is double of the size of the data returned!). Cheers, Andre. -- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
On Jun 14, 2005, at 10:45 AM, Andre Merzky wrote:
readv is more posix like, and more generic. It always works if you are on read level (e.g. HDF5 VFD layer ;-).
eread is more powerful: it allows applicatoin specific optimization which is not achievable with readv (the size of the iovecs in the read request is double of the size of the data returned!).
OK, I think we have both arrived at the same overall conclusion. I think eRead would be useful as a way to package an underlying complex service for implementing remote data requests. One must be able to extend the services on both the client and the server side to provide new e-modes to the user that implement these services. The vector read ops (not necessarily readv/pread, but perhaps something similar that describes patterned reads in a compact form, would be useful for other application use cases where we are are not permitted (or have no desire) touch or extend the remote service. I think readv/pread is a bit *too* restrictive, but we should have some similarly compact set of read ops that allow for gather/scatter type remote operations that do not require the service be installed on both ends (eg. just client side) in addition to an eRead() interface for access to two-sided services. So I guess we have a need to do both. For a suitable vread alternative, it would be useful to have something like read_pattern(descriptor,buffer,int nlogicaldims,int logicaldims[],offset[],block[],stride[]); to specify a patterned operation. The list of iovecs[] can be used for gather operations that cannot be encoded as a regular pattern. -john
Quoting [John Shalf] (Jun 14 2005):
On Jun 14, 2005, at 10:45 AM, Andre Merzky wrote:
readv is more posix like, and more generic. It always works if you are on read level (e.g. HDF5 VFD layer ;-).
eread is more powerful: it allows applicatoin specific optimization which is not achievable with readv (the size of the iovecs in the read request is double of the size of the data returned!).
OK, I think we have both arrived at the same overall conclusion.
Yes, fortunately :-)
I think eRead would be useful as a way to package an underlying complex service for implementing remote data requests. One must be able to extend the services on both the client and the server side to provide new e-modes to the user that implement these services. The vector read ops (not necessarily readv/pread, but perhaps something similar that describes patterned reads in a compact form, would be useful for other application use cases where we are are not permitted (or have no desire) touch or extend the remote service. I think readv/pread is a bit *too* restrictive, but we should have some similarly compact set of read ops that allow for gather/scatter type remote operations that do not require the service be installed on both ends (eg. just client side) in addition to an eRead() interface for access to two-sided services.
So I guess we have a need to do both.
For a suitable vread alternative, it would be useful to have something like read_pattern(descriptor,buffer,int nlogicaldims,int logicaldims[],offset[],block[],stride[]); to specify a patterned operation. The list of iovecs[] can be used for gather operations that cannot be encoded as a regular pattern.
I agree, the other mail. Cheers, Andre.
-john
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
On Monday 13 June 2005 19:28, John Shalf wrote:
Hi Andre, And here are number B.2 and E.
B.2:
You could use pitfalls to describe the clustering of reads (google for
"Remote Partial File Access Using Compact Pattern Descriptions"). It is a
compact language for describing regular subsets of files and it should at
least address one of your cons: need a protocol...
E (I already wrote that on the gat-devel-list):
You could submit a process to your archive which extracts the data for you and
registers the result as a new logical file. On the client side you could wrap
it in a nice library hiding the job submission stuff and on the
server/archive side you would prepare some executables for your tasks:
extract_hyperslab_from_hdf5
I think there is a 4th possibility. If each of the I/O operations can be requested asynchronously, then you can get the same net effect as the ERET/ESTO functionality of the GridFTP. The advantage of simply embedding that functionality into the higher-level concept of asynchronous calls is that if the underlying library does *not* support the async operations (or some subset of the operations cannot be performed asynchronously) , you can always perform the operations synchronously and still be able present
I do not like plan A or B for the reasons you state. I do not like Plan C because it is too tightly tied to a specific data transfer system implementation. I would propose a Plan D that simply augments the Task interface of SAGA. For example, if you allowed the user to fire off a number of async read operations Task handle1= channel.read(); Task handle2=channel.read(); container.addTask(handle1); container.addTask(handle2); container.waitAll();
The read operations in this example can be submitted as an eRead operation or they can be in separate threads, or they can simply be executed synchronously when you call waitAll() (this is in fact how some of the async MPI I/O was done on the first SGI origin machines... it was meant to look asynchronous, but in fact the calls did not initiate until you did a "Wait" for them).
Anyways, using the task interface provide more degrees of freedom for implementing async I/O than simply supporting the GridFTP way of doing things and it meshes gracefully with I/O implementations that do *not* offer an underlying async execution model.
The only modification that would be useful to add to the tasking interface is a notion of "readFrom()" and "writeTo()" which allows you to specify the file offset together with the read. Otherwise, the statefulness of the read() call would make the entire "task" interface useless with respect to file I/O.
-john
On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+
| Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | |
+-----------------------------------------------------------------+
I support this model in the more generic way I proposed .. It's a more accurate model because you explicitely submit a job when you are doing (potentially) complex operations and it allows for better performance. I think John is right when saying that eread is tricky and hard to use, I also agree with Andre that a simpler model is even more limiting than eread. Any other opinions? Andrei
E (I already wrote that on the gat-devel-list):
You could submit a process to your archive which extracts the data for you and registers the result as a new logical file. On the client side you could wrap it in a nice library hiding the job submission stuff and on the server/archive side you would prepare some executables for your tasks: extract_hyperslab_from_hdf5
<hyperslab> compress_file ... It shouldn't be that hard to prevent users from executing other executables on the server.
This method is async and you can use the job interface to check for the status of your conversion job.
I think there is a 4th possibility. If each of the I/O operations can be requested asynchronously, then you can get the same net effect as the ERET/ESTO functionality of the GridFTP. The advantage of simply embedding that functionality into the higher-level concept of asynchronous calls is that if the underlying library does *not* support the async operations (or some subset of the operations cannot be performed asynchronously) , you can always perform the operations synchronously and still be able present
I do not like plan A or B for the reasons you state. I do not like Plan C because it is too tightly tied to a specific data transfer system implementation. I would propose a Plan D that simply augments the Task interface of SAGA. For example, if you allowed the user to fire off a number of async read operations Task handle1= channel.read(); Task handle2=channel.read(); container.addTask(handle1); container.addTask(handle2); container.waitAll();
The read operations in this example can be submitted as an eRead operation or they can be in separate threads, or they can simply be executed synchronously when you call waitAll() (this is in fact how some of the async MPI I/O was done on the first SGI origin machines... it was meant to look asynchronous, but in fact the calls did not initiate until you did a "Wait" for them).
Anyways, using the task interface provide more degrees of freedom for implementing async I/O than simply supporting the GridFTP way of doing things and it meshes gracefully with I/O implementations that do *not* offer an underlying async execution model.
The only modification that would be useful to add to the tasking interface is a notion of "readFrom()" and "writeTo()" which allows you to specify the file offset together with the read. Otherwise, the statefulness of the read() call would make the entire "task" interface useless with respect to file I/O.
-john
On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+
| Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | |
+-----------------------------------------------------------------+
I think E is a good model for large operations. Basically, you do data preprocessing on request. It is good for cases where you can accept a round trip time of some seconds to minutes. I think it wont help for interactive visualization though... Andre. Quoting [Andrei Hutanu] (Jun 14 2005):
I support this model in the more generic way I proposed .. It's a more accurate model because you explicitely submit a job when you are doing (potentially) complex operations and it allows for better performance.
I think John is right when saying that eread is tricky and hard to use, I also agree with Andre that a simpler model is even more limiting than eread. Any other opinions?
Andrei
E (I already wrote that on the gat-devel-list):
You could submit a process to your archive which extracts the data for you and registers the result as a new logical file. On the client side you could wrap it in a nice library hiding the job submission stuff and on the server/archive side you would prepare some executables for your tasks: extract_hyperslab_from_hdf5
<hyperslab> compress_file ... It shouldn't be that hard to prevent users from executing other executables on the server.
This method is async and you can use the job interface to check for the status of your conversion job.
I think there is a 4th possibility. If each of the I/O operations can be requested asynchronously, then you can get the same net effect as the ERET/ESTO functionality of the GridFTP. The advantage of simply embedding that functionality into the higher-level concept of asynchronous calls is that if the underlying library does *not* support the async operations (or some subset of the operations cannot be performed asynchronously) , you can always perform the operations synchronously and still be able present
I do not like plan A or B for the reasons you state. I do not like Plan C because it is too tightly tied to a specific data transfer system implementation. I would propose a Plan D that simply augments the Task interface of SAGA. For example, if you allowed the user to fire off a number of async read operations Task handle1= channel.read(); Task handle2=channel.read(); container.addTask(handle1); container.addTask(handle2); container.waitAll();
The read operations in this example can be submitted as an eRead operation or they can be in separate threads, or they can simply be executed synchronously when you call waitAll() (this is in fact how some of the async MPI I/O was done on the first SGI origin machines... it was meant to look asynchronous, but in fact the calls did not initiate until you did a "Wait" for them).
Anyways, using the task interface provide more degrees of freedom for implementing async I/O than simply supporting the GridFTP way of doing things and it meshes gracefully with I/O implementations that do *not* offer an underlying async execution model.
The only modification that would be useful to add to the tasking interface is a notion of "readFrom()" and "writeTo()" which allows you to specify the file offset together with the read. Otherwise, the statefulness of the read() call would make the entire "task" interface useless with respect to file I/O.
-john
On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+
| Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | |
+-----------------------------------------------------------------+
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
Ah, by the way: (Grid)RPC is close to that model: just exchange a remote job with a remote procedure :-) Or remote service request :-)) In fact, I think the RPC lies just between eread as in (C) and remote jobs as in (E). Cheers, Andre. Quoting [Andre Merzky] (Jun 14 2005):
I think E is a good model for large operations. Basically, you do data preprocessing on request. It is good for cases where you can accept a round trip time of some seconds to minutes. I think it wont help for interactive visualization though...
Andre.
Quoting [Andrei Hutanu] (Jun 14 2005):
I support this model in the more generic way I proposed .. It's a more accurate model because you explicitely submit a job when you are doing (potentially) complex operations and it allows for better performance.
I think John is right when saying that eread is tricky and hard to use, I also agree with Andre that a simpler model is even more limiting than eread. Any other opinions?
Andrei
E (I already wrote that on the gat-devel-list):
You could submit a process to your archive which extracts the data for you and registers the result as a new logical file. On the client side you could wrap it in a nice library hiding the job submission stuff and on the server/archive side you would prepare some executables for your tasks: extract_hyperslab_from_hdf5
<hyperslab> compress_file ... It shouldn't be that hard to prevent users from executing other executables on the server.
This method is async and you can use the job interface to check for the status of your conversion job.
I think there is a 4th possibility. If each of the I/O operations can be requested asynchronously, then you can get the same net effect as the ERET/ESTO functionality of the GridFTP. The advantage of simply embedding that functionality into the higher-level concept of asynchronous calls is that if the underlying library does *not* support the async operations (or some subset of the operations cannot be performed asynchronously) , you can always perform the operations synchronously and still be able present
I do not like plan A or B for the reasons you state. I do not like Plan C because it is too tightly tied to a specific data transfer system implementation. I would propose a Plan D that simply augments the Task interface of SAGA. For example, if you allowed the user to fire off a number of async read operations Task handle1= channel.read(); Task handle2=channel.read(); container.addTask(handle1); container.addTask(handle2); container.waitAll();
The read operations in this example can be submitted as an eRead operation or they can be in separate threads, or they can simply be executed synchronously when you call waitAll() (this is in fact how some of the async MPI I/O was done on the first SGI origin machines... it was meant to look asynchronous, but in fact the calls did not initiate until you did a "Wait" for them).
Anyways, using the task interface provide more degrees of freedom for implementing async I/O than simply supporting the GridFTP way of doing things and it meshes gracefully with I/O implementations that do *not* offer an underlying async execution model.
The only modification that would be useful to add to the tasking interface is a notion of "readFrom()" and "writeTo()" which allows you to specify the file offset together with the read. Otherwise, the statefulness of the read() call would make the entire "task" interface useless with respect to file I/O.
-john
On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+
| Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | |
+-----------------------------------------------------------------+
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
On Jun 14, 2005, at 10:12 AM, Andrei Hutanu wrote:
I support this model in the more generic way I proposed .. It's a more accurate model because you explicitely submit a job when you are doing (potentially) complex operations and it allows for better performance.
I think John is right when saying that eread is tricky and hard to use, I also agree with Andre that a simpler model is even more limiting than eread.
I agree with that statement completely. I think pread/readv is a bit *too* simple. However, we should look at some more elemental interfaces for describing patterned read/write operations. I'm quite interested in following up on some of the leads that Thorsten gave us for instance.
Any other opinions?
Andrei
E (I already wrote that on the gat-devel-list):
You could submit a process to your archive which extracts the data for you and registers the result as a new logical file. On the client side you could wrap it in a nice library hiding the job submission stuff and on the server/archive side you would prepare some executables for your tasks: extract_hyperslab_from_hdf5
<hyperslab> compress_file ... It shouldn't be that hard to prevent users from executing other executables on the server.
This method is async and you can use the job interface to check for the status of your conversion job.
I think there is a 4th possibility. If each of the I/O operations can be requested asynchronously, then you can get the same net effect as the ERET/ESTO functionality of the GridFTP. The advantage of simply embedding that functionality into the higher-level concept of asynchronous calls is that if the underlying library does *not* support the async operations (or some subset of the operations cannot be performed asynchronously) , you can always perform the operations synchronously and still be able present
I do not like plan A or B for the reasons you state. I do not like Plan C because it is too tightly tied to a specific data transfer system implementation. I would propose a Plan D that simply augments the Task interface of SAGA. For example, if you allowed the user to fire off a number of async read operations Task handle1= channel.read(); Task handle2=channel.read(); container.addTask(handle1); container.addTask(handle2); container.waitAll();
The read operations in this example can be submitted as an eRead operation or they can be in separate threads, or they can simply be executed synchronously when you call waitAll() (this is in fact how some of the async MPI I/O was done on the first SGI origin machines... it was meant to look asynchronous, but in fact the calls did not initiate until you did a "Wait" for them).
Anyways, using the task interface provide more degrees of freedom for implementing async I/O than simply supporting the GridFTP way of doing things and it meshes gracefully with I/O implementations that do *not* offer an underlying async execution model.
The only modification that would be useful to add to the tasking interface is a notion of "readFrom()" and "writeTo()" which allows you to specify the file offset together with the read. Otherwise, the statefulness of the read() call would make the entire "task" interface useless with respect to file I/O.
-john
On Jun 12, 2005, at 11:02 AM, Andre Merzky wrote:
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+
| Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | |
+-----------------------------------------------------------------+
Quoting [John Shalf] (Jun 14 2005):
On Jun 14, 2005, at 10:12 AM, Andrei Hutanu wrote:
I support this model in the more generic way I proposed .. It's a more accurate model because you explicitely submit a job when you are doing (potentially) complex operations and it allows for better performance.
I think John is right when saying that eread is tricky and hard to use, I also agree with Andre that a simpler model is even more limiting than eread.
I agree with that statement completely. I think pread/readv is a bit *too* simple. However, we should look at some more elemental interfaces for describing patterned read/write operations. I'm quite interested in following up on some of the leads that Thorsten gave us for instance.
If you don't mind, I can give you a short version of the technique Thorsten refers to. Assume you have binary data which are regularily structured, e.g. an rgb image of resolution x*y. With resolution 6*4 that looks like: rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb rgb Or, as file stream: rgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgbrgb each element (r, g or b) being two byte for example. No, consider the request for a subsampled and downsampled version of that image: at offset (1,1) you want an image 2*2 with half resolution 11 111 111 012 345 678 901 234 567 0 --- --- --- --- --- --- 1 --- RGB --- RGB --- --- 2 --- --- --- --- --- --- 3 --- rgb --- rgb --- --- LS (Line Segments) can be used to descibe a single rgb triplet to be read. For example, the first RGB above is: (l,r) = (3,5) l: left-most byte -> 3 r: righ-most byte -> 5 FALLS (family of line segments) can be used to describe a pattern . The line of RGBs above can be described as: (l,r,s,n) = (3,5,6,2) l: left-most byte -> 3 r: righ-most byte -> 5 s: stride between two consecutive l elements -> 6 n: number of consecutive line segments -> 2 Falls can be nested. The another parameter is added to the set, which is in turn a fall. So the above subsampled subsetted image would be: (1,1,2,2,(3,5,6,2)) That gives a sequence of FALLS, starting at line 1 (not 0), ending at line 1, repeating with stride 2, for 2 times. You see, that maps pretty well to hyperslabs in HDF5, but fits basically all regularily structured binary data. I does obviously not work for compressed data, unstructured data etc. For reference, see: F. Isaila and W. Tichy. Clusterfile: A flexible physical layout parallel file system. Proceedings of IEEE Cluster Computing Conference, October 2001. Thorsten, Andrei and I implemented that one for remote file access, and called it pread (pattern_read), which worked nice indeed. Obviously, its up to taste if the pattern gets specified as string or recursive data structure... Cheers, Andre. -- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
Hi List,
I went through the IO thread again, and also had a chat with
John Shalf, and I'd like to summarize the outcome of the
discussion. Please consider that as a joint proposal of
John and me for inclusion in the file IO methods.
Observations:
- normal read/write has severe drawbacks on remote IO, if
used extensively, both sync and async
- external preprocessing of data for read can be accomplisehd
by spawning preprocessing jobs
- async is well covered by the task model
- there exists various approaches to improve throughput
for IO intensive apps, amongst them:
- (A) gather/scatter (see readv (2)
- (B) FALLS (regular paterns on binary data)
- (C) eRead (see ERET/ESTO in gridftp)
Remarks:
- the options A, B and C show increasing powerfull
expressions, but also require increasing concertation
between client and server side.
- A is, being POSIX, well known
- B maps to hyperslabs pretty well, a seemingly common
access pattern
- C maps GridFTP, a commonly used protocol, very well
Proposal:
- There seem advantages to A, B and C. Also, the need
for more than simple read seems obvious. Hence we
propose to include A, B and C into the SAGA API.
void readV (in array<ivec> ivec,
out array<string> buffers );
void writeV (in array<ivec> ivec,
in array<string> buffers );
void readP (in pattern pattern,
out string buffer,
out long len_out );
void writeP (in pattern pattern,
in string buffer,
out long len_out );
void lsEModes (out array
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
Of course, I like the idea adding pattern reads to saga. ;-) At the same time I have the feeling that there must be second document. Something like the "The Annotated SAGA Reference Manual", a tutorial or sample apps written in SAGA. On the one hand you should document the ideas behind the API (why did you include readE, .... ) and on the other hand you should show how to solve common problems ("see how easy it is to create a module for server-side data processing in SAGA"). Thorsten On Friday 17 June 2005 21:34, Andre Merzky wrote:
Hi List,
I went through the IO thread again, and also had a chat with John Shalf, and I'd like to summarize the outcome of the discussion. Please consider that as a joint proposal of John and me for inclusion in the file IO methods.
Observations:
- normal read/write has severe drawbacks on remote IO, if used extensively, both sync and async
- external preprocessing of data for read can be accomplisehd by spawning preprocessing jobs
- async is well covered by the task model
- there exists various approaches to improve throughput for IO intensive apps, amongst them:
- (A) gather/scatter (see readv (2) - (B) FALLS (regular paterns on binary data) - (C) eRead (see ERET/ESTO in gridftp)
Remarks:
- the options A, B and C show increasing powerfull expressions, but also require increasing concertation between client and server side.
- A is, being POSIX, well known
- B maps to hyperslabs pretty well, a seemingly common access pattern
- C maps GridFTP, a commonly used protocol, very well
Proposal:
- There seem advantages to A, B and C. Also, the need for more than simple read seems obvious. Hence we propose to include A, B and C into the SAGA API.
void readV (in array<ivec> ivec, out array<string> buffers ); void writeV (in array<ivec> ivec, in array<string> buffers );
void readP (in pattern pattern, out string buffer, out long len_out ); void writeP (in pattern pattern, in string buffer, out long len_out );
void lsEModes (out array
emodes ); void readE (in string emode, in string spec, out string buffer, out long len_out ); void writeE (in string emode, in string spec, in string buffer, out long len_out ); We think that adding the 7 calls does not bloat the API (although increases the file method number significantly), but will make the API much more usable for the targeted use cases.
Please comment :-)
Cheers, Andre.
Quoting [Andre Merzky] (Jun 12 2005):
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
Ah, am I seeing someone volonteering here? Great! :-D A. Quoting [Thorsten Schuett] (Jun 20 2005):
Of course, I like the idea adding pattern reads to saga. ;-)
At the same time I have the feeling that there must be second document. Something like the "The Annotated SAGA Reference Manual", a tutorial or sample apps written in SAGA. On the one hand you should document the ideas behind the API (why did you include readE, .... ) and on the other hand you should show how to solve common problems ("see how easy it is to create a module for server-side data processing in SAGA").
Thorsten
On Friday 17 June 2005 21:34, Andre Merzky wrote:
Hi List,
I went through the IO thread again, and also had a chat with John Shalf, and I'd like to summarize the outcome of the discussion. Please consider that as a joint proposal of John and me for inclusion in the file IO methods.
Observations:
- normal read/write has severe drawbacks on remote IO, if used extensively, both sync and async
- external preprocessing of data for read can be accomplisehd by spawning preprocessing jobs
- async is well covered by the task model
- there exists various approaches to improve throughput for IO intensive apps, amongst them:
- (A) gather/scatter (see readv (2) - (B) FALLS (regular paterns on binary data) - (C) eRead (see ERET/ESTO in gridftp)
Remarks:
- the options A, B and C show increasing powerfull expressions, but also require increasing concertation between client and server side.
- A is, being POSIX, well known
- B maps to hyperslabs pretty well, a seemingly common access pattern
- C maps GridFTP, a commonly used protocol, very well
Proposal:
- There seem advantages to A, B and C. Also, the need for more than simple read seems obvious. Hence we propose to include A, B and C into the SAGA API.
void readV (in array<ivec> ivec, out array<string> buffers ); void writeV (in array<ivec> ivec, in array<string> buffers );
void readP (in pattern pattern, out string buffer, out long len_out ); void writeP (in pattern pattern, in string buffer, out long len_out );
void lsEModes (out array
emodes ); void readE (in string emode, in string spec, out string buffer, out long len_out ); void writeE (in string emode, in string spec, in string buffer, out long len_out ); We think that adding the 7 calls does not bloat the API (although increases the file method number significantly), but will make the API much more usable for the targeted use cases.
Please comment :-)
Cheers, Andre.
Quoting [Andre Merzky] (Jun 12 2005):
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
On Monday 20 June 2005 09:06, Andre Merzky wrote:
Ah, am I seeing someone volonteering here? Great! :-D Ok, you may use the pattern library from GridLab :-p
Thorsten
A.
Quoting [Thorsten Schuett] (Jun 20 2005):
Of course, I like the idea adding pattern reads to saga. ;-)
At the same time I have the feeling that there must be second document. Something like the "The Annotated SAGA Reference Manual", a tutorial or sample apps written in SAGA. On the one hand you should document the ideas behind the API (why did you include readE, .... ) and on the other hand you should show how to solve common problems ("see how easy it is to create a module for server-side data processing in SAGA").
Thorsten
On Friday 17 June 2005 21:34, Andre Merzky wrote:
Hi List,
I went through the IO thread again, and also had a chat with John Shalf, and I'd like to summarize the outcome of the discussion. Please consider that as a joint proposal of John and me for inclusion in the file IO methods.
Observations:
- normal read/write has severe drawbacks on remote IO, if used extensively, both sync and async
- external preprocessing of data for read can be accomplisehd by spawning preprocessing jobs
- async is well covered by the task model
- there exists various approaches to improve throughput for IO intensive apps, amongst them:
- (A) gather/scatter (see readv (2) - (B) FALLS (regular paterns on binary data) - (C) eRead (see ERET/ESTO in gridftp)
Remarks:
- the options A, B and C show increasing powerfull expressions, but also require increasing concertation between client and server side.
- A is, being POSIX, well known
- B maps to hyperslabs pretty well, a seemingly common access pattern
- C maps GridFTP, a commonly used protocol, very well
Proposal:
- There seem advantages to A, B and C. Also, the need for more than simple read seems obvious. Hence we propose to include A, B and C into the SAGA API.
void readV (in array<ivec> ivec, out array<string> buffers ); void writeV (in array<ivec> ivec, in array<string> buffers );
void readP (in pattern pattern, out string buffer, out long len_out ); void writeP (in pattern pattern, in string buffer, out long len_out );
void lsEModes (out array
emodes ); void readE (in string emode, in string spec, out string buffer, out long len_out ); void writeE (in string emode, in string spec, in string buffer, out long len_out ); We think that adding the 7 calls does not bloat the API (although increases the file method number significantly), but will make the API much more usable for the targeted use cases.
Please comment :-)
Cheers, Andre.
Quoting [Andre Merzky] (Jun 12 2005):
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
:-P Quoting [Thorsten Schuett] (Jun 20 2005):
On Monday 20 June 2005 09:06, Andre Merzky wrote:
Ah, am I seeing someone volonteering here? Great! :-D Ok, you may use the pattern library from GridLab :-p
Thorsten
A.
Quoting [Thorsten Schuett] (Jun 20 2005):
Of course, I like the idea adding pattern reads to saga. ;-)
At the same time I have the feeling that there must be second document. Something like the "The Annotated SAGA Reference Manual", a tutorial or sample apps written in SAGA. On the one hand you should document the ideas behind the API (why did you include readE, .... ) and on the other hand you should show how to solve common problems ("see how easy it is to create a module for server-side data processing in SAGA").
Thorsten
On Friday 17 June 2005 21:34, Andre Merzky wrote:
Hi List,
I went through the IO thread again, and also had a chat with John Shalf, and I'd like to summarize the outcome of the discussion. Please consider that as a joint proposal of John and me for inclusion in the file IO methods.
Observations:
- normal read/write has severe drawbacks on remote IO, if used extensively, both sync and async
- external preprocessing of data for read can be accomplisehd by spawning preprocessing jobs
- async is well covered by the task model
- there exists various approaches to improve throughput for IO intensive apps, amongst them:
- (A) gather/scatter (see readv (2) - (B) FALLS (regular paterns on binary data) - (C) eRead (see ERET/ESTO in gridftp)
Remarks:
- the options A, B and C show increasing powerfull expressions, but also require increasing concertation between client and server side.
- A is, being POSIX, well known
- B maps to hyperslabs pretty well, a seemingly common access pattern
- C maps GridFTP, a commonly used protocol, very well
Proposal:
- There seem advantages to A, B and C. Also, the need for more than simple read seems obvious. Hence we propose to include A, B and C into the SAGA API.
void readV (in array<ivec> ivec, out array<string> buffers ); void writeV (in array<ivec> ivec, in array<string> buffers );
void readP (in pattern pattern, out string buffer, out long len_out ); void writeP (in pattern pattern, in string buffer, out long len_out );
void lsEModes (out array
emodes ); void readE (in string emode, in string spec, out string buffer, out long len_out ); void writeE (in string emode, in string spec, in string buffer, out long len_out ); We think that adding the 7 calls does not bloat the API (although increases the file method number significantly), but will make the API much more usable for the targeted use cases.
Please comment :-)
Cheers, Andre.
Quoting [Andre Merzky] (Jun 12 2005):
Hi again,
consider following use case for remote IO. Given a large binary 2D field on a remote host, the client wans to access a 2D sub portion of that field. Dependend on the remote file layout, that requires usually more than one read operation, since the standard read (offset, length) is agnostic to the 2D layout.
For more complex operations (subsampling, get a piece of a jpg file), the number of remote operations grow very fast. Latency then stringly discourages that type of remote IO.
For that reason, I think that the remote file IO as specified by SAGA's Strawman as is will only be usable for a limited and trivial set of remote I/O use cases.
There are three (basic) approaches:
A) get the whole thing, and do ops locally Pro: - one remote op, - simple logic - remote side doesn't need to know about file structure - easily implementable on application level Con: - getting the header info of a 1GB data file comes with, well, some overhead ;-)
B) clustering of calls: do many reads, but send them as a single request. Pro: - transparent to application - efficient Con: - need to know about dependencies of reads (a header read needed to determine size of field), or included explicite 'flushes' - need a protocol to support that - the remote side needs to support that
C) data specific remote ops: send a high level command, and get exactly what you want. Pro: - most efficient Con: - need a protocol to support that - the remote side needs to support that _specific_ command
The last approach (C) is what I have best experiences with. Also, that is what GridFTP as a common file access protocol supports via ERET/ESTO operations.
I want to propose to include a C-like extension to the File API of the strawman, which basically maps well to GridFTP, but should also map to other implementations of C.
That extension would look like:
void lsEModes (out array
emodes ); void eWrite (in string emode, in string spec, in string buffer out long len_out ); void eRead (in string emode, in string spec, out string buffer, out long len_out ); - hooks for gridftp-like opaque ERET/ESTO features - spec: string for pattern as in GridFTP's ESTO/ERET - emode: string for ident. as in GridFTP's ESTO/ERET
EMode: a specific remote I/O command supported lsEModes: list the EModes available in this implementation eRead/eWrite: read/write data according to the emode spec
Example (in perl for brevity):
my $file = SAGA::File new ("http://www.google.com/intl/en/images/logo.gif"); my @emodes = $file->lsEModes ();
if ( grep (/^jpeg_block$/, @emodes) ) { my ($buff, $len) = file.eRead ("jpeg_block", "22x4+7+8"); }
I would discourage support for B, since I do not know any protocoll supporting that approach efficiently, and also it needs approximately the same infrastructure setup as C.
As A is easily implementable on application level, or within any SAGA implementation, there is no need for support on API level -- however, A is insufficient for all but some trivial cases.
Comments welcome :-))
Cheers, Andre.
-- +-----------------------------------------------------------------+ | Andre Merzky | phon: +31 - 20 - 598 - 7759 | | Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 | | Dept. of Computer Science | mail: merzky@cs.vu.nl | | De Boelelaan 1083a | www: http://www.merzky.net | | 1081 HV Amsterdam, Netherlands | | +-----------------------------------------------------------------+
participants (6)
-
Andre Merzky
-
Andrei Hutanu
-
Hartmut Kaiser
-
John Shalf
-
Jon MacLaren
-
Thorsten Schuett