Re: [ogsa-wg] effective use of resource lifetimes in grid infrastructure

6 Dec 2004

      Steve's email bounced.
----
Hiro Kishimoto

Subject: Re: [ogsa-wg] effective use of resource lifetimes in grid
infrastructure
Cc: ogsa-wg@ggf.org
To: Steve Loughran <steve_loughran@hpl.hp.com>
From: Steve Tuecke <tuecke@mcs.anl.gov>
Date: Mon, 6 Dec 2004 09:52:17 -0600

Steve,

The ManagedJob WS-Resource need not be hosted on the same host as the  
job itself.

Here's a draft document that you might find useful that describes the  
GT4 WS GRAM approach in more detail:

http://www-unix.globus.org/toolkit/docs/development/3.9.3/execution/ 
wsgram/WS_GRAM_Approach.html

-Steve

On Dec 6, 2004, at 7:12 AM, Steve Loughran wrote:
...
Ian Foster wrote:
...
Steve:
A variety of semantics and connections are possible between a
"WS-Resource" and an "entity that the WS-Resource repesents",  
including both your (a) and (b) below. I don't believe that the  
implied resource pattern implies that one particular approach be  
adopted.
The following are some rough notes on how we have chosen to handle  
things in the GT4 GRAM service. This may perhaps be relevant to your  
problem.
The approach that we take in GT4 GRAM is as follows:
1) A GRAM ManagedJobFactory defines a "create job" operation that:
a) creates a job, and also
b) creates a ManagedJob WS-Resource, which represents the resource  
manager's view of the job.
2) The ManagedJob WS-Resource and the job are then linked as follows:
a) Destroying the ManagedJob WS-Resource kills the job
b) State changes in the job are reflected in the ManagedJob  
WS-Resource
c) Termination of the job also destroys the ManagedJob WS-Resource,  
but not immediately: we find that you typically want to leave the  
managedjob state around for "a while" after the job terminates to  
allow clients to figure out what happened to the job after the fact
Regards -- Ian.
Ian,
What is your fault tolerance strategy here?
Is every ManagedJob WS-Resource hosted on the same host (and perhaps,
same process) as the job itself?
This would mean that there is no way for the managedjob EPR to fail
without the job itself failing, but would require the entire set of  
job hosts to be visible for inbound SOAP messages. And prevent you  
moving a job from one node to another without some difficultly (the  
classic CORBA object-moved problem, I believe, though HTTP 304  
responses would work if only SOAP stacks processed them reliably)
I am trying to do a design which would enable (though would not
require) only a subset of nodes -call them portal nodes- to be visible  
to outside callers, with the rest of the nodes only accessible to the  
portal itself. Once I assume this architecture, modelling the  
resources gets complex, as EPRs contain routing info that may become  
invalid if a portal node fails.
-steve

Hiro Kishimoto

tags

participants (1)