DRMAA2: Job State Model Discussion
Dear all, this discussion thread is intended to clarify the new job state model in DRMAA2. DRMAA1 allows the following job states: UNDETERMINED QUEUED_ACTIVE SYSTEM_ON_HOLD USER_ON_HOLD USER_SYSTEM_ON_HOLD RUNNING SYSTEM_SUSPENDED USER_SUSPENDED USER_SYSTEM_SUSPENDED DONE FAILED The following questions are already identified, please provide your expertise: Which states need to be removed / added from the viewpoint of PBS, LSF, Condor, SGE, Unicore, GridWay, OGSA-BES and SAGA ? Which of the job states were never realized in DRMAA implementations ? Can we remove them ? Which of the states are too generic ? How do we resolve this ? Which of the existing state transitions are a problem ? Missing ones ? Do we want an extensible job state model as in OGSA-BES ? If yes, how to realize ? More questions are welcome, but first try to answer the given ones. Thanks, Peter.
On Sun, Jan 18, 2009 at 9:18 PM, Peter Tröger
DRMAA1 allows the following job states: UNDETERMINED QUEUED_ACTIVE SYSTEM_ON_HOLD USER_ON_HOLD USER_SYSTEM_ON_HOLD RUNNING SYSTEM_SUSPENDED USER_SUSPENDED USER_SYSTEM_SUSPENDED DONE FAILED [...] Which states need to be removed / added from the viewpoint of PBS, LSF, Condor, SGE, Unicore, GridWay, OGSA-BES and SAGA ?
Some of DRMSs didn't have either suspend or hold state -- don't remember which ones exactly (Condor?). What is important in my opinion, is to allow drmaa_control() to return some kind of "not implemented" error.
Which of the job states were never realized in DRMAA implementations ? Can we remove them ?
I think the differentiation between system and user hold/suspend is very much SGE specific as far as I remember. Personally, I also hate UNDETERMINED state. If this was only up to my decision, I'd surely remove it. Not being able to get job state result is error in most cases, so having a special state for that is useless . For example, in our case, when we're implementing BES on top of DRMAA, I loop for a few times when I get UNDETERMINED state and throw a fault eventually. I'd rather have DRMAA implementation loop for a few times and if that gives nothing, return an error.
Which of the states are too generic ? How do we resolve this ?
I don't like FAILED state meaning both failed and terminated. It could be split into FAILED and TERMINATED states. But then, we need to discuss how this maps to what drmaa_wait() returns. By the way, did you have an opportunity to discuss the future of drmaa_wif*() functions?
Do we want an extensible job state model as in OGSA-BES ? If yes, how to realize ?
I think that this would be cool to have. DRMAA is mostly viewed as a low-level API to local DRMS. For that scenarios, the current state model is OK. On the other hand, there are scenarios when DRMAA is used to add simple API on top of some higher-level system -- as in GridWay AFAIR. We were also actually thinking of implementing DRMAA interface to a remote OGSA-BES service. For that kind of scenarios, an extensible job mode (e.g. for stage in/out states which are rather rarely observable in local DRMS) would be useful. If there are many votes for that, we might start a discussion on how to actually specify/implement it. Until then, I don't think there's any use going into the details. -- Piotr Domagalski FedStage Systems Ltd.
participants (2)
-
Peter Tröger
-
Piotr Domagalski