
Appended below is feedback on the API from six Use Case authors, (and some SAGA counter comments -- some of which were discussed in the SAGA-RG session on Mon 03 Oct). Use case authors were asked the following: ====== Subject: SAGA API specification feedback Hi, This mail reaches you because you are listed as contact for a Use Case [1] submitted to the SAGA Research Group [2] at GGF [3]. The SAGA API v0.2 specification which is based on the submitted use cases (and other input) has stablized sufficiently over the past few months and is now rapidly converging toward a GGF submission. We would be grateful if you as potenital 'clients' of the SAGA API could review the current draft, verify that it indeed serves your use cases. Please also, let us know your frank opinion if you think the current spec satisfies the all important "S" -- Simple -- in SAGA. Any feedback on what you might like to see done differently will also be very useful. Also, we would like to invite you to a dedicated session at GGF15 [4] (exact time and date to be announced) to discuss the mappings of the API to the use cases. The Use Case collection (including your use case) can be found at [5]. The SAGA API draft is available at [6], a short version of the spec containing the API only can be found at [7]. For general information on the SAGA group, please check [8] and [9]. With best regards, SAGA-RG. [1] http://wiki.cct.lsu.edu/saga/space/start/use-cases.pdf [2] http://forge.gridforum.org/projects/saga-rg/ [3] http://www.ggf.org/ [4] http://www.ggf.org/ggf_events_ggf15.htm [5] http://wiki.cct.lsu.edu/saga/space/start/use-cases.pdf [6] http://wiki.cct.lsu.edu/saga/space/start/strawman-api-v0.2.pdf [7] http://wiki.cct.lsu.edu/saga/space/start/strawman-api-v0.2.short.pdf [8] http://forge.gridforum.org/projects/saga-rg/ [9] http://wiki.cct.lsu.edu/saga/space/start ==== And here is are their responses: Grid SuperScalar: ----------------- The SAGA API version 0.2 is useful for our requirements but we find some things missing. We detail more each part of the API: - Session and Context: We don't have any special requirements in this area. - Error: No information given in the specification. - Task: As it is supposed to be the asyncronous version of each SAGA API method, we think that this may be also useful for us in order to do asyncronous job submission, like the one achieved with globus_gram_client_register_job_request in contrast with globus_gram_client_job_request. - Attributes and Name spaces: No special comments. - Files and Logical Files: This two interfaces are very good if you want to access to a remote file in the first case, and to work with a replica location system in the second case, but nothing is provided in order to copy files from one machine to another if you don't want to use a replica system. In our use case, our run-time is aware of the location of files, so an easy mechanism in order to copy files between machines must be provided. This is not easy to do as the API is now. A possible solution is to include copy, move or erase methods for files in the File API. <<< Comment Andre: -------------- I think there is a misunderstanding. The file packages is inheriting from the NameSpace package. That is, a saga::file implements the saga::name_space_entry interface, and a saga::directory implements the saga::name_space_directory interface. All operations which are agnostic to content of the file (such as create, copy, move, open, rename, list, ...) are defined in the namespace interface. That interface is also inherited by the logical_file package, so the same methods are available there. Hence, the logical_file and the file packages only have those methods defined, which distinguish them from simple name spaces.
- Jobs: It covers our needs in order to describe a job (also in terms of adding restrictions to a job) and we think the job state diagram is complete. The only thing we find is missing is a call for waiting till a notification arrives (whatever the notification is). We submit several jobs at the same time, and we need to receive notifications of the states of those jobs in order to take actions in our run-time. So we follow a notification model instead of a polling model (we wait for notifications for arrive, instead of polling for state changes from our run-time). In our opinion the API would be more complete if it includes both models for job state control (polling and notification model), so this gives more freedom to the API user. - Streams: This API also is useful for us, in order to achieve communication between the workers and the master in an easy way to exchange a reduced amount of information. ====== Pascal Kleijer: --------------- Here are some pointers on what can be done for the future revisions. - Naming. The different APIs do not respect all the same naming pattern. Some like files are a direct UNIX style command line translation (i.e. "ls" should be "list"). I would recommend using a uniform method, attributes and constant naming. If you use OO design then stick with OO paradigm. Use full naming instead of acronyms or abbreviations, unless they are commonly known and used like URL, HTTP, CPU, etc. SAGA_NumCpus should be SAGA_NumberCpus or SAGA_CPUCount. This makes it much easier to read in the source code then cryptic names. - The use of all upper case of lower case in naming is subject to discussion. But by habit all constants are in upper case, attributes and methods in lower case unless they are composite names. - Use of "_" in names is C style programming. In OO it is only used if upper or lower case mix naming cannot be used, for example a constant. So "byte_written" becomes "byteWritten" or "SAGA_JobCmd" would become "SAGA_JOB_COMMAND". Depending on who write each API you can see the writer\222s main coding language influence. For more information about code convention, SUN Microsystems has a good tutorial: http://java.sun.com/docs/codeconv/. Yes it is for Java but it can be applied to any OO based language or procedural language. - Typos: well there a number of typos to be removed. OK it is still a v0.2 ;) - In the stream API for the "write" and "read" methods. Why not add an 'offset' attribute to the calls? This might be language specific, but in Java for example you can not just shift the initial pointer like in C/C++ so the data always has to start at 0. Forcing to buffers to be used at index 0 all the time might not be welcome and additional programming overhead will be necessary to use the API. ====== GridLab: Application Migration ---------------------- +-----------------------------------------------------------------+ The SAGA API allows to migrate any job it can handle with the job class, using the migrate method. That provides an easy solution for the GridLab migration use case if supported by the implementation/middleware/backend: -------------------------------------------------------------- #include <saga.hpp> #include <vector> #include <string> using namespace std; int main () { saga::job_server js; saga::job j = js.run_job ("remote.host.net", "my_app"); job_definition jd = j.get_job_definition (); vector <string> hosts; vector <string> files; hosts.push_back (string ("near.host.net")); files.push_back (string ("http://remote.host.net/file > http://near.host.net/file")); jd.set_vector_attribute ("SAGA_HostList", hosts); jd.set_vector_attribute ("SAGA_FileTransfer", files); j.migrate (jd); cout << "Heureka!" << endl; return (0); } -------------------------------------------------------------- (Question: does the SAGA migrate call move checkpoint files automatically, or do they need to be specified in the new job description as above?) However, for the complete use case to be implemented on application level, a number of steps cannot be implemented in SAGA. The call sequence would be: In the application instance which performs the migration on the other job: - trigger migration for the remote job - discover new resource + move checkpoint data to new resource + schedule application on new resource + continue computation (and discontinue old job) In the application instance which gets migrated - get triggered from checkpointing = perform application level checkpointing - report checkpoint file location(s) Items marked with + possible to implement in SAGA - impossible to implement in SAGA = (currently) not related to SAGA. For the complete implementation of the use case, SAGA misses: 1) means to communicate with the remote application instance 2) means to discover new resources Notes: 1) means of communcation are actually given, but not per se usable for this use case. E.g. streams are a definite overkill for signalling checkpointing requests. Signals (as in job.signal (int signal)) would work, but only if the remote job uses signal handling as a checkpoint trigger. That also might be difficultato use if the job is running in a wrapper script, or in a virtual machine etc - that might not be transparent to SAGA, and would require direct communications. Also, the signalling method misses feedback about success of the operation, and cannot return information such as the location of checkpoint files. 2) the current SAGA API covers job submission to specific hosts, or lets the middleware choose a suitable host for submission. However, the brokering result is not exposed on API level, as would be neccessary for this specific use case, and possibly for other dynamically active Grid applications. One way to implement that is to provide a direct interface to Grid information systems, and on that way expose information about available resources. That would actually be more flexible, as is e.g. also allows the discovery of specific services, but would also require additional semantic knowledge on application lelvel. +-----------------------------------------------------------------+ ====== Univ of Vienna: Rainer Schmidt: --------------- I had a look at the short version of the SAGA spec. and send you some comments: I think the first tier covers all (and even more) of the general aspects of our use cases. Especially the file and job interfaces could be (partially) mapped well against our middleware. We require basic remote file handling, job submission, monitoring, and state inquiry. I'm not yet sure if I got Session and Context right but I think it would be feasible to integrate with our API and very helpful. I also like the idea of having an asynchronous API and the task interface. Of course, we would need to extend the API for accessing our VGE middleware specific services for e.g. service discovery, QoS negotiation or resource reservation. Hope this helps a litte! <<< use GridRPC?
===== SCOOP [LSU] ------ really like the Session API and Logical Files API... would help a lot. Everything looks ok and should be extremely valuable to SCOOP. For SCOOP ver 2 our job submission interface was from Gram-job-manager -> Condor Master -> Condor Pool. Reason : Portal was limited to GRAM submissions only. One of the main difficulties we faced was interpreting of error codes. It was very confusing what the exact error code meant since there were layers of RMs. I wonder if SAGA can help resolve such complexities. :) <<< Comment Andre: -------------- Yes and no - if a SAGA implementation reports errors badly (maybe because it does not get good error messages from the middleware), we cannot do much apart from saying "Uh, bad!". However, at least you will have one and only one consistent and reliable way of error reporting on application level. The submission chain will be completely hidden. However, if your grid requires the chain, you still need to implement it _somewhere_, be it in a saga adaptor, or behind a gatekeeper...
Feedback from Anonymous: ---------- We will not being using the SAGA API for our use case. SAGA evolved into somethign very different from what we originally found potentially useful for our use case (three calls- authorize(), copy_file(), and run_job()). We aren't interested in POSIX-like file semantics, threads, or the complexity of the API. <<< Comment Andre: -------------- Hmm, I checked their use case. It is not trivial (e.g. includes data base access, steering and visualization, at least to some extent). Also, a 'three call auth' we have (actually, we have zero call minimum and 5 call maximum). run_job and copy_file we have as well, both need exactly 2 calls in total: { saga::directory dir; dir.copy (src, target); saga::job_server server (); server.run_job ("remote.host.net", "/bin/myjob"); } So, it might be that they percieve SAGA as too complex, and that should be kept in mind, but the criticism is at least not well formulated I think.
======