DRMAA2: TERMINATED vs. FAILED state
Dear all, this discussion thread is intended to finalize the discussion about job states after execution end in DRMAA2. In DRMAA1, there is only the FAILED state, expressing that the job was running but did not finish successfully for some reason. Piotr proposed a separation between FAILED and TERMINATED jobs: http://www.ogf.org/pipermail/drmaa-wg/2009-January/000985.html We meanwhile had different proposals regarding this idea: Option 1) TERMINATED state = resubmission might help, FAILED state = resubmission unlikely to help (machine problem, misconfiguration) Option 2) TERMINATED state = triggered by an external entity, FAILED state = job terminated by itself Option 3) FAILED state = job command line could not be executed TERMINATED state = something else happened Option 4) Stick with FAILED only, and express special circumstances via the new job sub-state information Issue #5875 (originally form the PBS experience report) criticizes that FAILED currently expresses both user-requested termination and job failure. How is this issue related to the problem ? Another question is the relation to the wif_* functions. Please contribute with you opinion. Thanks, Peter.
I think ultimately the purpose here is to be able to tell when a job was killed by a user or administrator or a forced migration or some such. The internal/external explanation captures that best for me. I think the other subtleties of how exactly a job failed should be expressed another way, such as the substate information. Daniel Peter Tröger wrote:
Dear all,
this discussion thread is intended to finalize the discussion about job states after execution end in DRMAA2. In DRMAA1, there is only the FAILED state, expressing that the job was running but did not finish successfully for some reason. Piotr proposed a separation between FAILED and TERMINATED jobs:
http://www.ogf.org/pipermail/drmaa-wg/2009-January/000985.html
We meanwhile had different proposals regarding this idea:
Option 1) TERMINATED state = resubmission might help, FAILED state = resubmission unlikely to help (machine problem, misconfiguration)
Option 2) TERMINATED state = triggered by an external entity, FAILED state = job terminated by itself
Option 3) FAILED state = job command line could not be executed TERMINATED state = something else happened
Option 4) Stick with FAILED only, and express special circumstances via the new job sub-state information
Issue #5875 (originally form the PBS experience report) criticizes that FAILED currently expresses both user-requested termination and job failure. How is this issue related to the problem ?
Another question is the relation to the wif_* functions.
Please contribute with you opinion.
Thanks, Peter. -- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
participants (2)
-
Daniel Templeton
-
Peter Tröger