-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf Of Daniel Templeton Sent: Wednesday, March 30, 2005 3:34 AM To: DRMAA Working Group Subject: [drmaa-wg] Questions
In working on a remote implementation of the Java binding, I have run into a couple of interesting questions. What happens when during a call to drmaa_control (DRMAA_JOB_IDS_SESSION_ALL), more the implementation fails to performs the given action on more than one job for different reasons. For example, if I try to hold all jobs, but one job is already in a hold state, three jobs work ok, and the DRM goes down before acting on the last job, what is the return code?
The routine return code would need to indicate a compound error; BTW we do not have such error code defined, and the detailed error message would need to detail what happened.
When doing a drmaa_control(DRMAA_JOB_IDS_SESSION_ALL), what is the contract on failure, i.e. in what state will the jobs be left? In the case of a job failure, does that mean that all jobs will be left in the state that they were in before the call? If so, that's going to cause serious implementation problems. If not, that's going to cause serious usability problems.
Transactional interface would be quite useful here ... If a routine exits/fails during the call there is no good recourse. Job failure? Is this a separate question? One analogy would be teaching a university course. There would be students dropping the course, but the rest goes ahead. In case of absences things also go ahead, and when the students reappear the regime is known.
What happens when a job ends after a thread has called drmaa_synchronize (DRMAA_JOB_IDS_SESSION_ALL), but another thread "steals" the job exit info with a call to drmaa_wait()? I would assume that the synchronize thread should just assume that the job finished, even though its job record is gone. That is what the SGE implementation does.
Ha, races with job reaping info. The developers would need to be careful in multithreaded environments ... some guidelines would be necessary, but preferably outside of the normative docs. Hrabri
Daniel
-- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "Roads? Where we're going we don't need roads." * * -Dr. Emmett Brown * * Back to the Future (1985) * ***************************************************
Rajic, Hrabri wrote:
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf
Of
Daniel Templeton Sent: Wednesday, March 30, 2005 3:34 AM To: DRMAA Working Group Subject: [drmaa-wg] Questions
In working on a remote implementation of the Java binding, I have run into a couple of interesting questions. What happens when during a
call
to drmaa_control (DRMAA_JOB_IDS_SESSION_ALL), more the implementation fails to performs the given action on more than one job for different reasons. For example, if I try to hold all jobs, but one job is
already
in a hold state, three jobs work ok, and the DRM goes down before
acting
on the last job, what is the return code?
The routine return code would need to indicate a compound error; BTW we do not have such error code defined, and the detailed error message would need to detail what happened.
In other words, the spec completely fails to address this case. Something to keep in mind for 1.1 or 2.0.
When doing a drmaa_control(DRMAA_JOB_IDS_SESSION_ALL), what is the contract on failure, i.e. in what state will the jobs be left? In the case of a job failure, does that mean that all jobs will be left in the state that they were in before the call? If so, that's going to cause serious implementation problems. If not, that's going to cause serious usability problems.
Transactional interface would be quite useful here ... If a routine exits/fails during the call there is no good recourse.
Exactly the point I'm making. Without transactions, it's hard to use. With transactions, it's hard to implement.
Job failure? Is this a separate question? One analogy would be teaching a university course. There would be students dropping the course, but the rest goes ahead. In case of absences things also go ahead, and when the students reappear the regime is known.
That's a typo. I meant operation failure.
What happens when a job ends after a thread has called
drmaa_synchronize
(DRMAA_JOB_IDS_SESSION_ALL), but another thread "steals" the job exit info with a call to drmaa_wait()? I would assume that the synchronize thread should just assume that the job finished, even though its job record is gone. That is what the SGE implementation does.
Ha, races with job reaping info. The developers would need to be careful in multithreaded environments ... some guidelines would be necessary, but preferably outside of the normative docs.
The reason I bring it up is that this particular case is non-obvious. It's clear that waiting for the same job twice is bad, but it's not so clear when waiting for any or all. Daniel -- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "Roads? Where we're going we don't need roads." * * -Dr. Emmett Brown * * Back to the Future (1985) * ***************************************************
The routine return code would need to indicate a compound error; BTW we do not have such error code defined, and the detailed error message would need to detail what happened.
In other words, the spec completely fails to address this case. Something to keep in mind for 1.1 or 2.0.
I added a tracker item for the issue.
When doing a drmaa_control(DRMAA_JOB_IDS_SESSION_ALL), what is the contract on failure, i.e. in what state will the jobs be left? In the case of a job failure, does that mean that all jobs will be left in the state that they were in before the call? If so, that's going to cause serious implementation problems. If not, that's going to cause serious usability problems.
Transactional interface would be quite useful here ... If a routine exits/fails during the call there is no good recourse.
Exactly the point I'm making. Without transactions, it's hard to use. With transactions, it's hard to implement.
To demand a transactional behavior seems to me non-realistic. Most other groups (e.g. OGSA) have similar problems, take for example the SetResourceProperties operation in WS-ResourceProperties specification (chapter 7). The usual approach is to declare the problem as implementation-dependent.
(DRMAA_JOB_IDS_SESSION_ALL), but another thread "steals" the job exit info with a call to drmaa_wait()? I would assume that the synchronize thread should just assume that the job finished, even though its job record is gone. That is what the SGE implementation does.
Ha, races with job reaping info. The developers would need to be careful in multithreaded environments ... some guidelines would be necessary, but preferably outside of the normative docs.
The reason I bring it up is that this particular case is non-obvious. It's clear that waiting for the same job twice is bad, but it's not so clear when waiting for any or all.
The result seems to be that we need more clarification about multithreading issues in the spec. Is it worthwhile to open a tracker item for this, in order to collect all the specific findings ? Regards, Peter. .
On Thu, 7 Apr 2005, Peter Troeger wrote:
The routine return code would need to indicate a compound error; BTW we do not have such error code defined, and the detailed error message would need to detail what happened.
In other words, the spec completely fails to address this case. Something to keep in mind for 1.1 or 2.0.
I added a tracker item for the issue.
When doing a drmaa_control(DRMAA_JOB_IDS_SESSION_ALL), what is the contract on failure, i.e. in what state will the jobs be left? In the case of a job failure, does that mean that all jobs will be left in the state that they were in before the call? If so, that's going to cause serious implementation problems. If not, that's going to cause serious usability problems.
Transactional interface would be quite useful here ... If a routine exits/fails during the call there is no good recourse.
Exactly the point I'm making. Without transactions, it's hard to use. With transactions, it's hard to implement.
To demand a transactional behavior seems to me non-realistic. Most other groups (e.g. OGSA) have similar problems, take for example the SetResourceProperties operation in WS-ResourceProperties specification (chapter 7). The usual approach is to declare the problem as implementation-dependent.
(DRMAA_JOB_IDS_SESSION_ALL), but another thread "steals" the job exit info with a call to drmaa_wait()? I would assume that the synchronize thread should just assume that the job finished, even though its job record is gone. That is what the SGE implementation does.
Ha, races with job reaping info. The developers would need to be careful in multithreaded environments ... some guidelines would be necessary, but preferably outside of the normative docs.
The reason I bring it up is that this particular case is non-obvious. It's clear that waiting for the same job twice is bad, but it's not so clear when waiting for any or all.
The result seems to be that we need more clarification about multithreading issues in the spec. Is it worthwhile to open a tracker item for this, in order to collect all the specific findings ?
DRMAA specifies drmaa_control(DRMAA_JOB_IDS_SESSION_ALL) as an atomic call. In case of an error one of the DRMAA error codes is to be returned to indicate the failure. If so the call could be repeated. I don't see a reasonable means to improve DRMAA spec for that call so I would argue against filing a tracker item. Regards, Andreas
participants (4)
-
Andreas Haas
-
Daniel Templeton
-
Peter Troeger
-
Rajic, Hrabri