RE: [ogsa-wg] RE: Modeling State: Technical Questions

Hi Paul, Moving the question from can I suspend multiple jobs by sending a single message to a resource (either REST or WS-Resource) to weither this is a good thing. There is a balance between simplicity and efficiency - using a single message intoduces more complexities, as Steve Loughran illustrated, but is potentially more efficient than sending mutliple messages. Remembering that "Early optimisation is the root of all evil" (Knuth) - is adding support for suspending mutiple jobs using a single message an example of early optimisation? I would imagine that this should be a straight forward question since there is already considerable experience in using computational grids. Are users demanding the ability to suspend mutliple jobs using a single message? Is it for improved efficiency reasons? From my experience no, but others on this list will have considerably more experience. Could this be a case of "worse is better", simplicity is more important than efficiency? Perhaps there are other reasons for using a single message to interact with multiple jobs? cheers Mark
Ian,
I agree that this is good progress. So let's bank that and see if we can we can agree on one more thing, and then I'll ask a question.
Considering your list of abilities (a, b & c) below, do we agree that in terms of expressiveness, the ordering is:
c>b>a
i.e. using approach c, a client can request operations on:
a) single jobs: "where (jobid = urn:guid:364)"
b) sets of jobs: "where (jobid = urn:guid:364) or (jobid = urn:guid:401)"
If there is agreement on this, then we could move on to discussing why it is felt necessary to provide more than just c for the job submission service.
Regards
Paul
Ian wrote...
Savas:
It seems that we are in agreement, then, that we want the ability to:
a) Request operations on individual jobs identified by some sort of "jobid"
b) Request operations on sets of jobs identified by a user-supplied list of "jobids"
c) Request operations on sets of jobs identified by more abstract criteria
We also agree that (as I expressed in the email that started this discussion) such >requests can be expressed in a few different ways, with somewhat different >characteristics.
That's progress I hope.
Ian.
________________________________
From: Ian Foster [mailto:foster@mcs.anl.gov] Sent: 05 April 2005 17:59 To: Savas Parastatidis; Steve Loughran Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder; ogsa-wg; dave.pearson@oracle.com; gray@microsoft.com; humphrey@cs.virginia.edu; grimshaw@virginia.edu; aherbert@microsoft.com; gcf@indiana.edu; mark.linesch@hp.com; Frank Siebenlist; Tony Hey; Dave Berry; Paul Watson Subject: RE: [ogsa-wg] RE: Modeling State: Technical Questions
[I'm feeling increasingly bad about sending email to all of the people CCed here, who may not be interested in these issues at all but got addressed by Tony long ago...]
Savas:
It seems that we are in agreement, then, that we want the ability to:
a) Request operations on individual jobs identified by some sort of "jobid"
b) Request operations on sets of jobs identified by a user-supplied list of "jobids"
c) Request operations on sets of jobs identified by more abstract criteria
We also agree that (as I expressed in the email that started this discussion) such requests can be expressed in a few different ways, with somewhat different characteristics.
That's progress I hope.
Ian.
At 02:44 PM 4/5/2005 +0100, Savas Parastatidis wrote:
Dear Ian,
I dont think that the approach I proposed forces the user to do more than they would have to do anyway if EPRs were used. It is still the case that someone has to manage the EPRs to the resources in WSRF. This is similar to what happens in the real world. The online bookstore will ask for my credit card number (a URI), or the book store will as for an ISBN (another URI) or multiple ISBNs if I want to buy multiple books. The banking service will ask for my bank account number (another URI perhaps).
Also, there is no reason why a kill all my jobsmessage couldnt also be supported. But please note that this message is now addressed to the service (the container of resources) and not, as in the case of WSRF, to a specific resource. This is no different from what I am advocating.
Also& to Steves point about partial failure. If one wishes atomic transaction semantics, I dont see the difference from the two approaches&
Atomic
Msg -> resource 1
Msg -> resource 2
Msg -> resource 3
End Atomic
Vs
Msg
Atomic
Resource 1
Resource 2
Resource 3
End Atomic
In fact, I would argue that the latter is better because:
1. It uses fewer messages (and, Steve, I am not assuming only HTTP and the optimisations that may be supported)
2. I can more easily deal with the failures in an application specific-manner since my atomic TX semantics do not span multiple msgs.
(Anyway& who wants to do atomic TXs over the Web anyway? :-)
Regards,
-- Savas Parastatidis http://savas.parastatidis.name
From: Ian Foster [mailto:foster@mcs.anl.gov] Sent: Tuesday, April 05, 2005 2:22 PM To: Steve Loughran; Savas Parastatidis Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder; ogsa-wg; dave.pearson@oracle.com; gray@microsoft.com; humphrey@cs.virginia.edu; grimshaw@virginia.edu; aherbert@microsoft.com; gcf@indiana.edu; mark.linesch@hp.com; Frank Siebenlist; Tony Hey; Dave Berry Subject: Re: [ogsa-wg] RE: Modeling State: Technical Questions
Steve's note raises a key point for me: do we really want to force the user (as Savas seems to be advocating) to keep track of jobs running at a remote site?
I'd rather send a request "kill all my jobs" or "kill all my jobs that have run for more than a day" to the factory than carefully keep track of all jobs that I have active, and how long they have been running, so that I can send the big document (or stream) discussed below.
Ian.
At 02:10 PM 4/5/2005 +0100, Steve Loughran wrote:
Savas Parastatidis wrote:
Dear all, I think something needs to be clarified with regards to handling multiple jobs with one message. The beauty of document-oriented interactions is that you can do things like... <job-details-request> <job-id>urn:ogsa:job:guid:bla-bla-bla-001</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-010</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-029</job-id> </job-details-request> Or <job-suspend-request> <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-005</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-008</job-id> </job-suspend-request> The schema for the above document can allow anything from 0 to N number of <job-id> elements.
the trouble with any bulk operation is you have to handle partial failure. You need either atomic operations (not long lived transactions over HTTP Savas, I wouldn't be that daft), or a way of indicating that only a bit went wrong
Hence the 207 Multi-Status response in WebDav, the "something failed, look in the message". WebDav is still single instance (here a RESTy URL), but you can set >1 property and so have partial failure.
SOAP just has SOAPFault and extensions; no explicit multiple failure response. WS-RF-ResourceProperties has a similar problem with SetResourceProperties, but a different failure model in which any failure to set can result in a WS-BaseFault, indicating which failed, but providing no apparent information on which worked.
It seems to me that if you want to bulk stuff, you do need ways of (a) handling partial failure and (b) declaring what happens on partial failure. For the curions, WebDav's failure mode on file operations (MOVE, COPY) is explicitly declared to be that of failed file operations of Win98 on a FAT32 filesystem [1,2]
Alternatively, you dont go for bulk operations, neither on a multiple jobs, or on multiple properties of a job (remember, WS-RF doesn't declare atomic/transacted property operations, so all you do here is increase the window of instability, a window that already exists). Instead you just stream a series of operations over the same HTTP1.1 connection -assuming that everything is accessible at the same far-end host, and get a series of (potentially out of order, we are talking HTTP1.1) responses.
This could be efficient, and you could do better handling of failure. But you do need a SOAP stack that can keep an HTTP1.1 channel open for multiple requests. Axis doesnt, even if you get httpclient to do the HTTP work; I don't know about .NET/WSE. You also need developers to model the communication correctly. Manipulating JAXRPC proxies as if they represent remote objects is *clearly* the wrong way to do it. You'd almost want to model a queue of requests waiting to be POSTed, a queue you can fill up then push out. Something like this, in your Java-era language of choice :-
//different queues for SOAP, REST Queue q=new Soap12RequestQueue();
q.add(new StatePut(job1.uri,Job.LIFECYCLE,Job.SUSPENDED)); //let the queue reorder stuff if it wants to q.add(new StatePut(job2.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_OPTIMAL); q.add(new StatePut(job3.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_LAST);
q.setEventHandler(this); q.nonBlockingSubmit();
No, there is no code behind this example, and I am avoiding any hints as to what the even handler would look like. I think the key point is that once you embrace remote operations as async actions, then you can model the manipulations differently. Note also that I am representing job suspension not as an explicit suspend() operation, but as a request to put a job into the suspended state. This API could work with our friend REST just as easily as with WS-RF...
Anyway Savas, to conclude: do you have any evidence that a single document is suboptimal compared to a sequences of requests over an open HTTP/1.1 connection? That is, assuming we ignore the SHOULD in the HTTP1.1 specification " Clients SHOULD NOT pipeline requests using non-idempotent methods or non-idempotent sequences of methods" [3]
-Steve
[1] WebDav http://www.ietf.org/rfc/rfc2518.txt S8.9.2
"after encountering an error moving a non-collection resource as part of an infinite depth move, the server SHOULD try to finish as much of the original move operation as possible."
[2] http://lists.w3.org/Archives/Public/w3c-dist-auth/1997JulSep/0177.html
[3] RFC2616 HTTP1.1
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org <http://www.globus.org/>
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org <http://www.globus.org/>

Mark McKeown wrote:
Perhaps there are other reasons for using a single message to interact with multiple jobs?
Surely it depends on the complexity of the jobs? I can well imagine clients wanting to interact with all atomic jobs within a workflow and not knowing the details of what those atoms are. To make this a more concrete assertion, saying "suspend every job related to this workflow" is clearly of use to higher-level services, and I would not expect the owner of a complex workflow (e.g. one that does parameter-space exploration) to know the ids of everything they've kicked off. And since such things might involve hundreds of thousands of atomic work parcels, interacting with each one individually would not be a reasonable expansion in the amount of work to be performed, even with low-level protocol hacks. Simplicity and scalability have to be balanced against each other, alas. Donal.

But surely, if your workflow execution service was doing a good job of encapsulation, you would just be asking it to "suspend execution of every part of this workflow" - a single message to a single entity. The workflow execution service will know where all the work packages are (I hope). Jon. On Apr 6, 2005, at 9:07 AM, Donal K. Fellows wrote:
Mark McKeown wrote:
Perhaps there are other reasons for using a single message to interact with multiple jobs?
Surely it depends on the complexity of the jobs? I can well imagine clients wanting to interact with all atomic jobs within a workflow and not knowing the details of what those atoms are. To make this a more concrete assertion, saying "suspend every job related to this workflow" is clearly of use to higher-level services, and I would not expect the owner of a complex workflow (e.g. one that does parameter-space exploration) to know the ids of everything they've kicked off. And since such things might involve hundreds of thousands of atomic work parcels, interacting with each one individually would not be a reasonable expansion in the amount of work to be performed, even with low-level protocol hacks.
Simplicity and scalability have to be balanced against each other, alas.
Donal.

For what it's worth, the Globus user community has been running thousands of instances of our GRAM job submission service for quite a few years, with many many millions of jobs running through them, and as far as I am aware, no-one has ever asked for the ability to manage more than one job at a time. Certainly the lack of this facility hasn't seemed to stop anyone. Lots of caveats can be applied here: maybe people did ask, and I didn't hear; maybe they didn't think to ask; maybe our workloads are special (although there is a great variety). But it is a data point. Ian. At 11:59 AM 4/6/2005 +0100, Mark McKeown wrote:
Hi Paul, Moving the question from can I suspend multiple jobs by sending a single message to a resource (either REST or WS-Resource) to weither this is a good thing.
There is a balance between simplicity and efficiency - using a single message intoduces more complexities, as Steve Loughran illustrated, but is potentially more efficient than sending mutliple messages.
Remembering that "Early optimisation is the root of all evil" (Knuth) - is adding support for suspending mutiple jobs using a single message an example of early optimisation?
I would imagine that this should be a straight forward question since there is already considerable experience in using computational grids. Are users demanding the ability to suspend mutliple jobs using a single message? Is it for improved efficiency reasons? From my experience no, but others on this list will have considerably more experience.
Could this be a case of "worse is better", simplicity is more important than efficiency?
Perhaps there are other reasons for using a single message to interact with multiple jobs?
cheers Mark
Ian,
I agree that this is good progress. So let's bank that and see if we can we can agree on one more thing, and then I'll ask a question.
Considering your list of abilities (a, b & c) below, do we agree that in terms of expressiveness, the ordering is:
c>b>a
i.e. using approach c, a client can request operations on:
a) single jobs: "where (jobid = urn:guid:364)"
b) sets of jobs: "where (jobid = urn:guid:364) or (jobid = urn:guid:401)"
If there is agreement on this, then we could move on to discussing why it is felt necessary to provide more than just c for the job submission service.
Regards
Paul
Ian wrote...
Savas:
It seems that we are in agreement, then, that we want the ability to:
a) Request operations on individual jobs identified by some sort of "jobid"
b) Request operations on sets of jobs identified by a user-supplied list of "jobids"
c) Request operations on sets of jobs identified by more abstract criteria
We also agree that (as I expressed in the email that started this discussion) such >requests can be expressed in a few different ways, with somewhat different >characteristics.
That's progress I hope.
Ian.
________________________________
From: Ian Foster [mailto:foster@mcs.anl.gov] Sent: 05 April 2005 17:59 To: Savas Parastatidis; Steve Loughran Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder; ogsa-wg; dave.pearson@oracle.com; gray@microsoft.com; humphrey@cs.virginia.edu; grimshaw@virginia.edu; aherbert@microsoft.com; gcf@indiana.edu; mark.linesch@hp.com; Frank Siebenlist; Tony Hey; Dave Berry; Paul Watson Subject: RE: [ogsa-wg] RE: Modeling State: Technical Questions
[I'm feeling increasingly bad about sending email to all of the people CCed here, who may not be interested in these issues at all but got addressed by Tony long ago...]
Savas:
It seems that we are in agreement, then, that we want the ability to:
a) Request operations on individual jobs identified by some sort of "jobid"
b) Request operations on sets of jobs identified by a user-supplied list of "jobids"
c) Request operations on sets of jobs identified by more abstract criteria
We also agree that (as I expressed in the email that started this discussion) such requests can be expressed in a few different ways, with somewhat different characteristics.
That's progress I hope.
Ian.
At 02:44 PM 4/5/2005 +0100, Savas Parastatidis wrote:
Dear Ian,
I dont think that the approach I proposed forces the user to do more than they would have to do anyway if EPRs were used. It is still the case that someone has to manage the EPRs to the resources in WSRF. This is similar to what happens in the real world. The online bookstore will ask for my credit card number (a URI), or the book store will as for an ISBN (another URI) or multiple ISBNs if I want to buy multiple books. The banking service will ask for my bank account number (another URI perhaps).
Also, there is no reason why a kill all my jobsmessage couldnt also be supported. But please note that this message is now addressed to the service (the container of resources) and not, as in the case of WSRF, to a specific resource. This is no different from what I am advocating.
Also& to Steves point about partial failure. If one wishes atomic transaction semantics, I dont see the difference from the two approaches&
Atomic
Msg -> resource 1
Msg -> resource 2
Msg -> resource 3
End Atomic
Vs
Msg
Atomic
Resource 1
Resource 2
Resource 3
End Atomic
In fact, I would argue that the latter is better because:
1. It uses fewer messages (and, Steve, I am not assuming only HTTP and the optimisations that may be supported)
2. I can more easily deal with the failures in an application specific-manner since my atomic TX semantics do not span multiple msgs.
(Anyway& who wants to do atomic TXs over the Web anyway? :-)
Regards,
-- Savas Parastatidis http://savas.parastatidis.name
From: Ian Foster [mailto:foster@mcs.anl.gov] Sent: Tuesday, April 05, 2005 2:22 PM To: Steve Loughran; Savas Parastatidis Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder; ogsa-wg; dave.pearson@oracle.com; gray@microsoft.com; humphrey@cs.virginia.edu; grimshaw@virginia.edu; aherbert@microsoft.com; gcf@indiana.edu; mark.linesch@hp.com; Frank Siebenlist; Tony Hey; Dave Berry Subject: Re: [ogsa-wg] RE: Modeling State: Technical Questions
Steve's note raises a key point for me: do we really want to force the user (as Savas seems to be advocating) to keep track of jobs running at a remote site?
I'd rather send a request "kill all my jobs" or "kill all my jobs that have run for more than a day" to the factory than carefully keep track of all jobs that I have active, and how long they have been running, so that I can send the big document (or stream) discussed below.
Ian.
At 02:10 PM 4/5/2005 +0100, Steve Loughran wrote:
Savas Parastatidis wrote:
Dear all, I think something needs to be clarified with regards to handling multiple jobs with one message. The beauty of document-oriented interactions is that you can do things like... <job-details-request> <job-id>urn:ogsa:job:guid:bla-bla-bla-001</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-010</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-029</job-id> </job-details-request> Or <job-suspend-request> <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-005</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-008</job-id> </job-suspend-request> The schema for the above document can allow anything from 0 to N number of <job-id> elements.
the trouble with any bulk operation is you have to handle partial failure. You need either atomic operations (not long lived transactions over HTTP Savas, I wouldn't be that daft), or a way of indicating that only a bit went wrong
Hence the 207 Multi-Status response in WebDav, the "something failed, look in the message". WebDav is still single instance (here a RESTy URL), but you can set >1 property and so have partial failure.
SOAP just has SOAPFault and extensions; no explicit multiple failure response. WS-RF-ResourceProperties has a similar problem with SetResourceProperties, but a different failure model in which any failure to set can result in a WS-BaseFault, indicating which failed, but providing no apparent information on which worked.
It seems to me that if you want to bulk stuff, you do need ways of (a) handling partial failure and (b) declaring what happens on partial failure. For the curions, WebDav's failure mode on file operations (MOVE, COPY) is explicitly declared to be that of failed file operations of Win98 on a FAT32 filesystem [1,2]
Alternatively, you dont go for bulk operations, neither on a multiple jobs, or on multiple properties of a job (remember, WS-RF doesn't declare atomic/transacted property operations, so all you do here is increase the window of instability, a window that already exists). Instead you just stream a series of operations over the same HTTP1.1 connection -assuming that everything is accessible at the same far-end host, and get a series of (potentially out of order, we are talking HTTP1.1) responses.
This could be efficient, and you could do better handling of failure. But you do need a SOAP stack that can keep an HTTP1.1 channel open for multiple requests. Axis doesnt, even if you get httpclient to do the HTTP work; I don't know about .NET/WSE. You also need developers to model the communication correctly. Manipulating JAXRPC proxies as if they represent remote objects is *clearly* the wrong way to do it. You'd almost want to model a queue of requests waiting to be POSTed, a queue you can fill up then push out. Something like this, in your Java-era language of choice :-
//different queues for SOAP, REST Queue q=new Soap12RequestQueue();
q.add(new StatePut(job1.uri,Job.LIFECYCLE,Job.SUSPENDED)); //let the queue reorder stuff if it wants to q.add(new StatePut(job2.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_OPTIMAL); q.add(new StatePut(job3.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_LAST);
q.setEventHandler(this); q.nonBlockingSubmit();
No, there is no code behind this example, and I am avoiding any hints as to what the even handler would look like. I think the key point is that once you embrace remote operations as async actions, then you can model the manipulations differently. Note also that I am representing job suspension not as an explicit suspend() operation, but as a request to put a job into the suspended state. This API could work with our friend REST just as easily as with WS-RF...
Anyway Savas, to conclude: do you have any evidence that a single document is suboptimal compared to a sequences of requests over an open HTTP/1.1 connection? That is, assuming we ignore the SHOULD in the HTTP1.1 specification " Clients SHOULD NOT pipeline requests using non-idempotent methods or non-idempotent sequences of methods" [3]
-Steve
[1] WebDav http://www.ietf.org/rfc/rfc2518.txt S8.9.2
"after encountering an error moving a non-collection resource as part of an infinite depth move, the server SHOULD try to finish as much of the original move operation as possible."
[2] http://lists.w3.org/Archives/Public/w3c-dist-auth/1997JulSep/0177.html
[3] RFC2616 HTTP1.1
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org <http://www.globus.org/>
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org <http://www.globus.org/>
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org

And people have submitted hundreds of thousands of jobs at once in LSF queues, and been delighted by the fact that bkill 0¹ means kill all of them. :-) Stepping back from the "is this a good thing" argument a bit. In order to support _basic_ execution services, I think we should focus on the fundamental operations required to meet most use cases (which I believe is control of one job at a time). As we get some implementation experience, I believe we'll see the need for additional interfaces which can provide operations on groups of jobs. This might be something like one call which gives me a handle to a group of jobs (perhaps generated from a list of resource IDs, or from some kind of query) and then the "simple" operation can be used to operate on this job group. -- Chris On 6/4/05 10:15, "Ian Foster" <foster@mcs.anl.gov> wrote:
For what it's worth, the Globus user community has been running thousands of instances of our GRAM job submission service for quite a few years, with many many millions of jobs running through them, and as far as I am aware, no-one has ever asked for the ability to manage more than one job at a time. Certainly the lack of this facility hasn't seemed to stop anyone.
Lots of caveats can be applied here: maybe people did ask, and I didn't hear; maybe they didn't think to ask; maybe our workloads are special (although there is a great variety). But it is a data point.
Ian.
At 11:59 AM 4/6/2005 +0100, Mark McKeown wrote:
Hi Paul, Moving the question from can I suspend multiple jobs by sending a single message to a resource (either REST or WS-Resource) to weither this is a good thing.
There is a balance between simplicity and efficiency - using a single message intoduces more complexities, as Steve Loughran illustrated, but is potentially more efficient than sending mutliple messages.
Remembering that "Early optimisation is the root of all evil" (Knuth) - is adding support for suspending mutiple jobs using a single message an example of early optimisation?
I would imagine that this should be a straight forward question since there is already considerable experience in using computational grids. Are users demanding the ability to suspend mutliple jobs using a single message? Is it for improved efficiency reasons? From my experience no, but others on this list will have considerably more experience.
Could this be a case of "worse is better", simplicity is more important than efficiency?
Perhaps there are other reasons for using a single message to interact with multiple jobs?
cheers Mark
Ian,
I agree that this is good progress. So let's bank that and see if we can we can agree on one more thing, and then I'll ask a question.
Considering your list of abilities (a, b & c) below, do we agree that in terms of expressiveness, the ordering is:
c>b>a
i.e. using approach c, a client can request operations on:
a) single jobs: "where (jobid = urn:guid:364)"
b) sets of jobs: "where (jobid = urn:guid:364) or (jobid = urn:guid:401)"
If there is agreement on this, then we could move on to discussing why it is felt necessary to provide more than just c for the job submission service.
Regards
Paul
Ian wrote...
Savas:
It seems that we are in agreement, then, that we want the ability to:
a) Request operations on individual jobs identified by some sort of "jobid"
b) Request operations on sets of jobs identified by a user-supplied list of "jobids"
c) Request operations on sets of jobs identified by more abstract criteria
We also agree that (as I expressed in the email that started this discussion) such >requests can be expressed in a few different ways, with somewhat different >characteristics.
That's progress I hope.
Ian.
________________________________
From: Ian Foster [mailto:foster@mcs.anl.gov] Sent: 05 April 2005 17:59 To: Savas Parastatidis; Steve Loughran Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder; ogsa-wg; dave.pearson@oracle.com; gray@microsoft.com; humphrey@cs.virginia.edu; grimshaw@virginia.edu; aherbert@microsoft.com; gcf@indiana.edu; mark.linesch@hp.com; Frank Siebenlist; Tony Hey; Dave Berry; Paul Watson Subject: RE: [ogsa-wg] RE: Modeling State: Technical Questions
[I'm feeling increasingly bad about sending email to all of the people CCed here, who may not be interested in these issues at all but got addressed by Tony long ago...]
Savas:
It seems that we are in agreement, then, that we want the ability to:
a) Request operations on individual jobs identified by some sort of "jobid"
b) Request operations on sets of jobs identified by a user-supplied list of "jobids"
c) Request operations on sets of jobs identified by more abstract criteria
We also agree that (as I expressed in the email that started this discussion) such requests can be expressed in a few different ways, with somewhat different characteristics.
That's progress I hope.
Ian.
At 02:44 PM 4/5/2005 +0100, Savas Parastatidis wrote:
Dear Ian,
I dont think that the approach I proposed forces the user to do more than they would have to do anyway if EPRs were used. It is still the case that someone has to manage the EPRs to the resources in WSRF. This is similar to what happens in the real world. The online bookstore will ask for my credit card number (a URI), or the book store will as for an ISBN (another URI) or multiple ISBNs if I want to buy multiple books. The banking service will ask for my bank account number (another URI perhaps).
Also, there is no reason why a kill all my jobsmessage couldnt also be supported. But please note that this message is now addressed to the service (the container of resources) and not, as in the case of WSRF, to a specific resource. This is no different from what I am advocating.
Also& to Steves point about partial failure. If one wishes atomic transaction semantics, I dont see the difference from the two approaches&
Atomic
Msg -> resource 1
Msg -> resource 2
Msg -> resource 3
End Atomic
Vs
Msg
Atomic
Resource 1
Resource 2
Resource 3
End Atomic
In fact, I would argue that the latter is better because:
1. It uses fewer messages (and, Steve, I am not assuming only HTTP and the optimisations that may be supported)
2. I can more easily deal with the failures in an application specific-manner since my atomic TX semantics do not span multiple msgs.
(Anyway& who wants to do atomic TXs over the Web anyway? :-)
Regards,
-- Savas Parastatidis http://savas.parastatidis.name <http://savas.parastatidis.name/>
From: Ian Foster [mailto:foster@mcs.anl.gov] Sent: Tuesday, April 05, 2005 2:22 PM To: Steve Loughran; Savas Parastatidis Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder; ogsa-wg; dave.pearson@oracle.com; gray@microsoft.com; humphrey@cs.virginia.edu; grimshaw@virginia.edu; aherbert@microsoft.com; gcf@indiana.edu; mark.linesch@hp.com; Frank Siebenlist; Tony Hey; Dave Berry Subject: Re: [ogsa-wg] RE: Modeling State: Technical Questions
Steve's note raises a key point for me: do we really want to force the user (as Savas seems to be advocating) to keep track of jobs running at a remote site?
I'd rather send a request "kill all my jobs" or "kill all my jobs that have run for more than a day" to the factory than carefully keep track of all jobs that I have active, and how long they have been running, so that I can send the big document (or stream) discussed below.
Ian.
At 02:10 PM 4/5/2005 +0100, Steve Loughran wrote:
Savas Parastatidis wrote:
Dear all, I think something needs to be clarified with regards to handling multiple jobs with one message. The beauty of document-oriented interactions is that you can do things like... <job-details-request> <job-id>urn:ogsa:job:guid:bla-bla-bla-001</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-010</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-029</job-id> </job-details-request> Or <job-suspend-request> <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-005</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-008</job-id> </job-suspend-request> The schema for the above document can allow anything from 0 to N number of <job-id> elements.
the trouble with any bulk operation is you have to handle partial failure. You need either atomic operations (not long lived transactions over HTTP Savas, I wouldn't be that daft), or a way of indicating that only a bit went wrong
Hence the 207 Multi-Status response in WebDav, the "something failed, look in the message". WebDav is still single instance (here a RESTy URL), but you can set >1 property and so have partial failure.
SOAP just has SOAPFault and extensions; no explicit multiple failure response. WS-RF-ResourceProperties has a similar problem with SetResourceProperties, but a different failure model in which any failure to set can result in a WS-BaseFault, indicating which failed, but providing no apparent information on which worked.
It seems to me that if you want to bulk stuff, you do need ways of (a) handling partial failure and (b) declaring what happens on partial failure. For the curions, WebDav's failure mode on file operations (MOVE, COPY) is explicitly declared to be that of failed file operations of Win98 on a FAT32 filesystem [1,2]
Alternatively, you dont go for bulk operations, neither on a multiple jobs, or on multiple properties of a job (remember, WS-RF doesn't declare atomic/transacted property operations, so all you do here is increase the window of instability, a window that already exists). Instead you just stream a series of operations over the same HTTP1.1 connection -assuming that everything is accessible at the same far-end host, and get a series of (potentially out of order, we are talking HTTP1.1) responses.
This could be efficient, and you could do better handling of failure. But you do need a SOAP stack that can keep an HTTP1.1 channel open for multiple requests. Axis doesnt, even if you get httpclient to do the HTTP work; I don't know about .NET/WSE. You also need developers to model the communication correctly. Manipulating JAXRPC proxies as if they represent remote objects is *clearly* the wrong way to do it. You'd almost want to model a queue of requests waiting to be POSTed, a queue you can fill up then push out. Something like this, in your Java-era language of choice :-
//different queues for SOAP, REST Queue q=new Soap12RequestQueue();
q.add(new StatePut(job1.uri,Job.LIFECYCLE,Job.SUSPENDED)); //let the queue reorder stuff if it wants to q.add(new StatePut(job2.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_OPTIMAL); q.add(new StatePut(job3.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_LAST);
q.setEventHandler(this); q.nonBlockingSubmit();
No, there is no code behind this example, and I am avoiding any hints as to what the even handler would look like. I think the key point is that once you embrace remote operations as async actions, then you can model the manipulations differently. Note also that I am representing job suspension not as an explicit suspend() operation, but as a request to put a job into the suspended state. This API could work with our friend REST just as easily as with WS-RF...
Anyway Savas, to conclude: do you have any evidence that a single document is suboptimal compared to a sequences of requests over an open HTTP/1.1 connection? That is, assuming we ignore the SHOULD in the HTTP1.1 specification " Clients SHOULD NOT pipeline requests using non-idempotent methods or non-idempotent sequences of methods" [3]
-Steve
[1] WebDav http://www.ietf.org/rfc/rfc2518.txt S8.9.2
"after encountering an error moving a non-collection resource as part of an infinite depth move, the server SHOULD try to finish as much of the original move operation as possible."
[2] http://lists.w3.org/Archives/Public/w3c-dist-auth/1997JulSep/0177.html
[3] RFC2616 HTTP1.1
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster <http://www.mcs.anl.gov/~foster> Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org <http://www.globus.org/> <http://www.globus.org/>
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster <http://www.mcs.anl.gov/~foster> Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org <http://www.globus.org/> <http://www.globus.org/>
Ian Foster www.mcs.anl.gov/~foster <http://www.mcs.anl.gov/~foster> Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org <http://www.globus.org/>

Chris, On 6 Apr 2005, at 23:20, Christopher Smith wrote:
And people have submitted hundreds of thousands of jobs at once in LSF queues, and been delighted by the fact that ‘bkill 0’ means kill all of them. :-)
And I'm sure a few were seriously distressed, having bkilled a week's worth of work :-). Otherwise +1
Stepping back from the "is this a good thing" argument a bit.
In order to support _basic_ execution services, I think we should focus on the fundamental operations required to meet most use cases (which I believe is control of one job at a time).
As we get some implementation experience, I believe we'll see the need for additional interfaces which can provide operations on groups of jobs. This might be something like one call which gives me a handle to a group of jobs (perhaps generated from a list of resource IDs, or from some kind of query) and then the "simple" operation can be used to operate on this job group.
-- Chris
On 6/4/05 10:15, "Ian Foster" <foster@mcs.anl.gov> wrote:
For what it's worth, the Globus user community has been running thousands of instances of our GRAM job submission service for quite a few years, with many many millions of jobs running through them, and as far as I am aware, no-one has ever asked for the ability to manage more than one job at a time. Certainly the lack of this facility hasn't seemed to stop anyone.
Lots of caveats can be applied here: maybe people did ask, and I didn't hear; maybe they didn't think to ask; maybe our workloads are special (although there is a great variety). But it is a data point.
Ian.
At 11:59 AM 4/6/2005 +0100, Mark McKeown wrote:
Hi Paul, Moving the question from can I suspend multiple jobs by sending a single message to a resource (either REST or WS-Resource) to weither this is a good thing.
There is a balance between simplicity and efficiency - using a single message intoduces more complexities, as Steve Loughran illustrated, but is potentially more efficient than sending mutliple messages.
Remembering that "Early optimisation is the root of all evil" (Knuth) - is adding support for suspending mutiple jobs using a single message an example of early optimisation?
I would imagine that this should be a straight forward question since there is already considerable experience in using computational grids. Are users demanding the ability to suspend mutliple jobs using a single message? Is it for improved efficiency reasons? From my experience no, but others on this list will have considerably more experience.
Could this be a case of "worse is better", simplicity is more important than efficiency?
Perhaps there are other reasons for using a single message to interact with multiple jobs?
cheers Mark
Ian,
I agree that this is good progress. So let's bank that and see if we can we can agree on one more thing, and then I'll ask a question.
Considering your list of abilities (a, b & c) below, do we agree that in terms of expressiveness, the ordering is:
c>b>a
i.e. using approach c, a client can request operations on:
a) single jobs: "where (jobid = urn:guid:364)"
b) sets of jobs: "where (jobid = urn:guid:364) or (jobid = urn:guid:401)"
If there is agreement on this, then we could move on to discussing why it is felt necessary to provide more than just c for the job submission service.
Regards
Paul
Ian wrote...
Savas:
It seems that we are in agreement, then, that we want the ability to:
a) Request operations on individual jobs identified by some sort of "jobid"
b) Request operations on sets of jobs identified by a user-supplied list of "jobids"
c) Request operations on sets of jobs identified by more abstract criteria
We also agree that (as I expressed in the email that started this discussion) such >requests can be expressed in a few different ways, with somewhat different >characteristics.
That's progress I hope.
Ian.
________________________________
From: Ian Foster [mailto:foster@mcs.anl.gov] Sent: 05 April 2005 17:59 To: Savas Parastatidis; Steve Loughran Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder; ogsa-wg; dave.pearson@oracle.com; gray@microsoft.com; humphrey@cs.virginia.edu; grimshaw@virginia.edu; aherbert@microsoft.com; gcf@indiana.edu; mark.linesch@hp.com; Frank Siebenlist; Tony Hey; Dave Berry; Paul Watson Subject: RE: [ogsa-wg] RE: Modeling State: Technical Questions
[I'm feeling increasingly bad about sending email to all of the people CCed here, who may not be interested in these issues at all but got addressed by Tony long ago...]
Savas:
It seems that we are in agreement, then, that we want the ability to:
a) Request operations on individual jobs identified by some sort of "jobid"
b) Request operations on sets of jobs identified by a user-supplied list of "jobids"
c) Request operations on sets of jobs identified by more abstract criteria
We also agree that (as I expressed in the email that started this discussion) such requests can be expressed in a few different ways, with somewhat different characteristics.
That's progress I hope.
Ian.
At 02:44 PM 4/5/2005 +0100, Savas Parastatidis wrote:
Dear Ian,
I dont think that the approach I proposed forces the user to do more than they would have to do anyway if EPRs were used. It is still the case that someone has to manage the EPRs to the resources in WSRF. This is similar to what happens in the real world. The online bookstore will ask for my credit card number (a URI), or the book store will as for an ISBN (another URI) or multiple ISBNs if I want to buy multiple books. The banking service will ask for my bank account number (another URI perhaps).
Also, there is no reason why a kill all my jobsmessage couldnt also be supported. But please note that this message is now addressed to the service (the container of resources) and not, as in the case of WSRF, to a specific resource. This is no different from what I am advocating.
Also& to Steves point about partial failure. If one wishes atomic transaction semantics, I dont see the difference from the two approaches&
Atomic
Msg -> resource 1
Msg -> resource 2
Msg -> resource 3
End Atomic
Vs
Msg
Atomic
Resource 1
Resource 2
Resource 3
End Atomic
In fact, I would argue that the latter is better because:
1. It uses fewer messages (and, Steve, I am not assuming only HTTP and the optimisations that may be supported)
2. I can more easily deal with the failures in an application specific-manner since my atomic TX semantics do not span multiple msgs.
(Anyway& who wants to do atomic TXs over the Web anyway? :-)
Regards,
-- Savas Parastatidis http://savas.parastatidis.name <http://savas.parastatidis.name/>
From: Ian Foster [mailto:foster@mcs.anl.gov] Sent: Tuesday, April 05, 2005 2:22 PM To: Steve Loughran; Savas Parastatidis Cc: Mark McKeown; Karl Czajkowski; Dennis Gannon; Samuel Meder; ogsa-wg; dave.pearson@oracle.com; gray@microsoft.com; humphrey@cs.virginia.edu; grimshaw@virginia.edu; aherbert@microsoft.com; gcf@indiana.edu; mark.linesch@hp.com; Frank Siebenlist; Tony Hey; Dave Berry Subject: Re: [ogsa-wg] RE: Modeling State: Technical Questions
Steve's note raises a key point for me: do we really want to force the user (as Savas seems to be advocating) to keep track of jobs running at a remote site?
I'd rather send a request "kill all my jobs" or "kill all my jobs that have run for more than a day" to the factory than carefully keep track of all jobs that I have active, and how long they have been running, so that I can send the big document (or stream) discussed below.
Ian.
At 02:10 PM 4/5/2005 +0100, Steve Loughran wrote:
Savas Parastatidis wrote:
Dear all, I think something needs to be clarified with regards to handling multiple jobs with one message. The beauty of document-oriented interactions is that you can do things like... <job-details-request> <job-id>urn:ogsa:job:guid:bla-bla-bla-001</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-010</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-029</job-id> </job-details-request> Or <job-suspend-request> <job-id>urn:ogsa:job:guid:bla-bla-bla-002</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-005</job-id> <job-id>urn:ogsa:job:guid:bla-bla-bla-008</job-id> </job-suspend-request> The schema for the above document can allow anything from 0 to N number of <job-id> elements.
the trouble with any bulk operation is you have to handle partial failure. You need either atomic operations (not long lived transactions over HTTP Savas, I wouldn't be that daft), or a way of indicating that only a bit went wrong
Hence the 207 Multi-Status response in WebDav, the "something failed, look in the message". WebDav is still single instance (here a RESTy URL), but you can set >1 property and so have partial failure.
SOAP just has SOAPFault and extensions; no explicit multiple failure response. WS-RF-ResourceProperties has a similar problem with SetResourceProperties, but a different failure model in which any failure to set can result in a WS-BaseFault, indicating which failed, but providing no apparent information on which worked.
It seems to me that if you want to bulk stuff, you do need ways of (a) handling partial failure and (b) declaring what happens on partial failure. For the curions, WebDav's failure mode on file operations (MOVE, COPY) is explicitly declared to be that of failed file operations of Win98 on a FAT32 filesystem [1,2]
Alternatively, you dont go for bulk operations, neither on a multiple jobs, or on multiple properties of a job (remember, WS-RF doesn't declare atomic/transacted property operations, so all you do here is increase the window of instability, a window that already exists). Instead you just stream a series of operations over the same HTTP1.1 connection -assuming that everything is accessible at the same far-end host, and get a series of (potentially out of order, we are talking HTTP1.1) responses.
This could be efficient, and you could do better handling of failure. But you do need a SOAP stack that can keep an HTTP1.1 channel open for multiple requests. Axis doesnt, even if you get httpclient to do the HTTP work; I don't know about .NET/WSE. You also need developers to model the communication correctly. Manipulating JAXRPC proxies as if they represent remote objects is *clearly* the wrong way to do it. You'd almost want to model a queue of requests waiting to be POSTed, a queue you can fill up then push out. Something like this, in your Java-era language of choice :-
//different queues for SOAP, REST Queue q=new Soap12RequestQueue();
q.add(new StatePut(job1.uri,Job.LIFECYCLE,Job.SUSPENDED)); //let the queue reorder stuff if it wants to q.add(new StatePut(job2.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_OPTIMA L); q.add(new StatePut(job3.uri,Job.LIFECYCLE,Job.SUSPENDED),Queue.POSITION_LAST);
q.setEventHandler(this); q.nonBlockingSubmit();
No, there is no code behind this example, and I am avoiding any hints as to what the even handler would look like. I think the key point is that once you embrace remote operations as async actions, then you can model the manipulations differently. Note also that I am representing job suspension not as an explicit suspend() operation, but as a request to put a job into the suspended state. This API could work with our friend REST just as easily as with WS-RF...
Anyway Savas, to conclude: do you have any evidence that a single document is suboptimal compared to a sequences of requests over an open HTTP/1.1 connection? That is, assuming we ignore the SHOULD in the HTTP1.1 specification " Clients SHOULD NOT pipeline requests using non-idempotent methods or non-idempotent sequences of methods" [3]
-Steve
[1] WebDav http://www.ietf.org/rfc/rfc2518.txt S8.9.2
"after encountering an error moving a non-collection resource as part of an infinite depth move, the server SHOULD try to finish as much of the original move operation as possible."
[2] http://lists.w3.org/Archives/Public/w3c-dist-auth/1997JulSep/ 0177.html
[3] RFC2616 HTTP1.1
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster <http://www.mcs.anl.gov/~foster> Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org <http://www.globus.org/> <http://www.globus.org/>
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster <http://www.mcs.anl.gov/~foster> Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org <http://www.globus.org/> <http://www.globus.org/>
Ian Foster www.mcs.anl.gov/~foster <http://www.mcs.anl.gov/~foster> Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org <http://www.globus.org/>
-- Take care: Dr. David Snelling < David . Snelling . UK . Fujitsu . com > Fujitsu Laboratories of Europe Hayes Park Central Hayes End Road Hayes, Middlesex UB4 8FE +44-208-606-4649 (Office) +44-208-606-4539 (Fax) +44-7768-807526 (Mobile)

Hi,
In order to support _basic_ execution services, I think we should focus on the fundamental operations required to meet most use cases (which I believe is control of one job at a time).
As we get some implementation experience, I believe we'll see the need for additional interfaces which can provide operations on groups of jobs. This might be something like one call which gives me a handle to a group of jobs (perhaps generated from a list of resource IDs, or from some kind of query) and then the "simple" operation can be used to operate on this job group.
... which is pretty much what I said in my previous e-mail, just in concrete terms and with priorities in place. Needless to say, I totally agree. As another example of a "simple" operation: how many of the zillion jobs in my array job executed successfully, had errors, are executing, and are still in the queue? (Is it true to say that Globus is more HPC oriented (say, fewer bigger jobs), and Platform stuff is more array job oriented (say, submit half-a-million jobs at once), and this explains the difference in opinion between Ian and Chris? :-) By the way: In the _basic_ execution services, are we modeling one single job, or a job queue also? Or that's not decided yet? (From the question you already guessed I'm thinking about resource models, right? :-). Regards, Fred Maciel.
participants (7)
-
Christopher Smith
-
David Snelling
-
Donal K. Fellows
-
Fred Maciel
-
Ian Foster
-
Jon MacLaren
-
Mark McKeown