Re: [saga-rg] Job States

4 Aug 2005


      Quoting [Christopher Smith] (Aug 04 2005):
...
On 29/7/05 10:40, "Andre Merzky" <andre@merzky.net> wrote:
...
SAGA Jobs have currently following states:
[chop]
...
I got the comment from colleques that PreStaging and
PostStaging are missing.  Indeed these stages seem not to
fir into any of the above ones.  Running would be a
candidate, but since the remote resource is not neccessarily
used anymore, that might be confusing.  Should these stages
be added?  However, they do also not appear in the DRMAA
specification AFAIK.
Any thoughts?
These states can be added.
Also, there is a more complicated state model for "activities" emerging from
the OGSA-BES work, that also includes sub-states for file staging, etc, etc.
We can perhaps incorporate some of that as well, although I'm happy with
general Pre-execution and Post-execution states to cover all of this.
Pre-Execution/Post-Execution sounds good to me.  I guess we
don't want to have a too complex state model, and these two
can incorporate whatever SAGA or the backend seems necessary
to do before/after the job is actually running...
...
Perhaps we can discuss on the call tomorrow.
Great.
...
...
Another question: Assume I check a job status and find it
'DoneFail' - how can I determine the reason of failure?  It
would be useful to know the status the job was in before it
failed (e.g. if it was prestating, I know then that staging
failed, and the job never really started).  Also it would be
nice to be able to query for any error message.
There is the getJobExitStatus method on the Job interface so that you can
get things like the exit code and the signal number that caused termination.
As for querying the state which preceded the failure, it sounds like a good
idea (LSF does this by keeping a history log for jobs that can be queried
via a "bhist" command). Perhaps adding an optional string to the
JobExitStatus class would be sufficient for this kind of extended
information? The problem is that this stuff is not particularly standardized
across resource managers I think.
I think a (potentially) extensive error message on the exit
status object is the simpliest solution - if job failed,
look there to find some infos about the reason, if
available.  Nice.

Cheers, Andre.
...
...
I think that the error query is distinct from the exception
mechanism we will have: a job entering DoneFail should NOT
throw an exception in my opinion - but that leads to above
question: how can I query the error leading ot the DoneFail
state?
I agree.
-- Chris
-- 
+-----------------------------------------------------------------+
| Andre Merzky                      | phon: +31 - 20 - 598 - 7759 |
| Vrije Universiteit Amsterdam (VU) | fax : +31 - 20 - 598 - 7653 |
| Dept. of Computer Science         | mail: merzky@cs.vu.nl       |
| De Boelelaan 1083a                | www:  http://www.merzky.net |
| 1081 HV Amsterdam, Netherlands    |                             |
+-----------------------------------------------------------------+