effective use of resource lifetimes in grid infrastructure

Hello, I am trying to design the CDDLM deployment API in the WS-RF style, and am having problems with understanding the best way to make effective use of resource lifetimes in a fault-tolerant environment. I would like some assistance : Scenario ======== -We are supporting automated deployment of applications to allocated resources. -Applications are deployed through a deployment portal; the applications may deployed on hosts that are not running the services necessary for direct communication, or behind a firewall. -The deployment portal is in a DMZ or otherwise reachable by all callers. All EPRs must therefore refer to the portal. I had originally envisaged that to destroy an application,a <wsrl:Destroy/> call could be made. Yet I have a problem. I want to be fault tolerant without relying on load balancing routers, round-robin DNS or mDNS hostnames. In the stateful resources pattern, I can have EPRs that contain policy on how to renew an EPR. Presumably that would list a service group location from whence any other deployment portal EPR could be found that could renew an invalid EPR for us (e.g. by extracting application identification info and constructing a new EPR referring to the local portal) This all seems workable, ignoring the problem of specifying the policy of updating EPRs. Presumably we'd have to do an app specific algorithm and policy. What is causing me trouble is resource lifetimes. Once I have a world in which you can have multiple EPRs all referring to the same resource, the same deployed application, what does that mean for WS-ResourceLifetime? If I have multiple, different EPRs, is it right that a <wsrl:Destroy/> call on one of them should terminate the final application? What if there was state on each portal related to the EPR that needed cleaning up at the end of a conversation, and so required either that destroy call, or timeout-initiated destruction? I think the problem here is "what should I be modelling with the implicit resource pattern"? . Is it:- (a) the resource in question *is* the application. When the resource is destroyed, so is the application. (b) the resource in question is merely a view of the application. When the resource is destroyed, the view goes away, but the application remains until destroyed by some other means. -Steve

Steve: A variety of semantics and connections are possible between a "WS-Resource" and an "entity that the WS-Resource repesents", including both your (a) and (b) below. I don't believe that the implied resource pattern implies that one particular approach be adopted. The following are some rough notes on how we have chosen to handle things in the GT4 GRAM service. This may perhaps be relevant to your problem. The approach that we take in GT4 GRAM is as follows: 1) A GRAM ManagedJobFactory defines a "create job" operation that: a) creates a job, and also b) creates a ManagedJob WS-Resource, which represents the resource manager's view of the job. 2) The ManagedJob WS-Resource and the job are then linked as follows: a) Destroying the ManagedJob WS-Resource kills the job b) State changes in the job are reflected in the ManagedJob WS-Resource c) Termination of the job also destroys the ManagedJob WS-Resource, but not immediately: we find that you typically want to leave the managedjob state around for "a while" after the job terminates to allow clients to figure out what happened to the job after the fact Regards -- Ian. At 01:11 PM 12/3/2004 +0000, Steve Loughran wrote:
Hello,
I am trying to design the CDDLM deployment API in the WS-RF style, and am having problems with understanding the best way to make effective use of resource lifetimes in a fault-tolerant environment. I would like some assistance :
...
I think the problem here is "what should I be modelling with the implicit resource pattern"? . Is it:-
(a) the resource in question *is* the application. When the resource is destroyed, so is the application.
(b) the resource in question is merely a view of the application. When the resource is destroyed, the view goes away, but the application remains until destroyed by some other means.
-Steve
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org

Ian Foster wrote:
Steve:
A variety of semantics and connections are possible between a "WS-Resource" and an "entity that the WS-Resource repesents", including both your (a) and (b) below. I don't believe that the implied resource pattern implies that one particular approach be adopted.
The following are some rough notes on how we have chosen to handle things in the GT4 GRAM service. This may perhaps be relevant to your problem.
The approach that we take in GT4 GRAM is as follows:
1) A GRAM ManagedJobFactory defines a "create job" operation that:
a) creates a job, and also
b) creates a ManagedJob WS-Resource, which represents the resource manager's view of the job.
2) The ManagedJob WS-Resource and the job are then linked as follows:
a) Destroying the ManagedJob WS-Resource kills the job
b) State changes in the job are reflected in the ManagedJob WS-Resource
c) Termination of the job also destroys the ManagedJob WS-Resource, but not immediately: we find that you typically want to leave the managedjob state around for "a while" after the job terminates to allow clients to figure out what happened to the job after the fact
Regards -- Ian.
Ian, What is your fault tolerance strategy here? Is every ManagedJob WS-Resource hosted on the same host (and perhaps, same process) as the job itself? This would mean that there is no way for the managedjob EPR to fail without the job itself failing, but would require the entire set of job hosts to be visible for inbound SOAP messages. And prevent you moving a job from one node to another without some difficultly (the classic CORBA object-moved problem, I believe, though HTTP 304 responses would work if only SOAP stacks processed them reliably) I am trying to do a design which would enable (though would not require) only a subset of nodes -call them portal nodes- to be visible to outside callers, with the rest of the nodes only accessible to the portal itself. Once I assume this architecture, modelling the resources gets complex, as EPRs contain routing info that may become invalid if a portal node fails. -steve

Steve: You can indeed put the ManagedJob WS-Resource on a different host. You might find this URL relevant: http://www-unix.globus.org/toolkit/docs/development/3.9.3/execution/wsgram/W... Regards -- Ian. At 01:12 PM 12/6/2004 +0000, Steve Loughran wrote:
Ian Foster wrote:
Steve: A variety of semantics and connections are possible between a "WS-Resource" and an "entity that the WS-Resource repesents", including both your (a) and (b) below. I don't believe that the implied resource pattern implies that one particular approach be adopted. The following are some rough notes on how we have chosen to handle things in the GT4 GRAM service. This may perhaps be relevant to your problem. The approach that we take in GT4 GRAM is as follows: 1) A GRAM ManagedJobFactory defines a "create job" operation that: a) creates a job, and also b) creates a ManagedJob WS-Resource, which represents the resource manager's view of the job. 2) The ManagedJob WS-Resource and the job are then linked as follows: a) Destroying the ManagedJob WS-Resource kills the job b) State changes in the job are reflected in the ManagedJob WS-Resource c) Termination of the job also destroys the ManagedJob WS-Resource, but not immediately: we find that you typically want to leave the managedjob state around for "a while" after the job terminates to allow clients to figure out what happened to the job after the fact Regards -- Ian.
Ian,
What is your fault tolerance strategy here?
Is every ManagedJob WS-Resource hosted on the same host (and perhaps, same process) as the job itself?
This would mean that there is no way for the managedjob EPR to fail without the job itself failing, but would require the entire set of job hosts to be visible for inbound SOAP messages. And prevent you moving a job from one node to another without some difficultly (the classic CORBA object-moved problem, I believe, though HTTP 304 responses would work if only SOAP stacks processed them reliably)
I am trying to do a design which would enable (though would not require) only a subset of nodes -call them portal nodes- to be visible to outside callers, with the rest of the nodes only accessible to the portal itself. Once I assume this architecture, modelling the resources gets complex, as EPRs contain routing info that may become invalid if a portal node fails.
-steve
_______________________________________________________________ Ian Foster www.mcs.anl.gov/~foster Math & Computer Science Div. Dept of Computer Science Argonne National Laboratory The University of Chicago Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A. Tel: 630 252 4619 Fax: 630 252 1997 Globus Alliance, www.globus.org
participants (2)
-
Ian Foster
-
Steve Loughran