Quoting [Daniel Templeton] (Feb 02 2009):
Since I won't make the meeting, here's my feedback.
Peter Tröger wrote:
To represent the temporarily undetermined state, we expand the TryAgainLaterException to apply to drmaa_job_ps() as well.
3. Voting about separate "TERMINATED" vs. "FAILED" state - Semantics
A job that exits via the terminated state has the potential to succeed if resubmitted. It entered the terminated state due to an action taken by the job owner, an administrator, or the DRM system itself, possibly on behalf of the terminated job. A job that exits via the failed state is unlikely to succeed if resubmitted. It entered the failed state due to an error in the job or a misconfiguration of the machine on which it ran.
You can't always know if a resubmit will yield a chance of success : a broken file system, or insufficient or bad memory, may lead to internal fail states, and may well allow the job to succeed next time. An endless loop in the application may always occur, and trigger the scheduler ot the system to eventually kill the job, without any chance of a later instance to do any better. So, maybe it is better to distinguish based on the information you _do_ have? - FAILED: the job terminated for internal reasons (i.e. the application met an internal error condition) - TERMINATED: the job termination was triggered by an external entity (e.g by the user, scheduler, system, ...)
There is a problem with my clean could-succeed/won't-succeed division. What if a job failed because the machine it ran on was wonky? That is clearly a failure, not a termination, but if the job were resubmitted and landed on any other machine, it would succeed. In that case, do we actually care if there was a difference between failure and termination?
- Resulting new job state transitions
There's one more thing we may want to consider. In SGE, a job can exit one of four ways. It can succeed. It can fail, which includes termination. It can request to be rescheduled. And it can be set into error state. The first two are handled fine by drmaa_wait(). The third can be recognized by drmaa_job_ps(), but it's not ideal. The fourth is completely unknowable from DRMAA. To the DRMAA client, it will look like the job was requeued to be rescheduled, but is never actually scheduled to run again. We might want to consider supporting some additional states, such as rescheduled or error, or maybe those states are something that the state/substate model would enable.
I vote for making the substate as generic as possible. I think forcing it to be an integer in unnecessarily limiting. Taking some Java APIs as examples, sometimes the substates are really just text messages that explain what's going on. I think that's valid and something we should allow.
"If all the tools you have is a hammer, every problem starts to look like a nail." So, my apologies to pulling the same string every time I post to this list *blush* Anyway, you may want to have a look at the SAGA state model, again: substates are defined as strings, but SAGA implementatios are enouraged to define these strings, and to adhere to a namespace. So, an SGE implementation would document the substates of RUNNING as SGE:RESCHEDULED SGE:ERROR Well, SGE:ERROR should go into a final state, not into RUNNING, right? But you got the picture. (GFD-90 p.65, last paragraph). Cheers, Andre.
4. Further DRMAA2 discussion
See the attached email from a few weeks ago.
Daniel
Date: Tue, 20 Jan 2009 08:46:24 -0800 From: Daniel Templeton <Dan.Templeton@Sun.COM> Subject: DRMAA v2 To: DRMAA Working Group <drmaa-wg@gridforum.org>
A few proposals for the meeting today:
PT12: < A language binding SHOULD specify numeric values for all DRMAA error constants. ---
Such a language binding SHOULD specify numeric values for all DRMAA error constants.
PT13: I definitely agree that PartialTimestamp is a boondoggle. I'm not sure I agree with using ISO8601, though, mostly because it presupposes a date/time *string*. In a high order language, I want to be able to use the native date/time object. How about specifying that a language should use a date/time object or primitive is it has one, and an ISO8601 string if it doesn't?
PT20: I think we can handle the resource request pretty easily, and I think we need it. We just need to add a resourceRequest attribute of type Dictionary and treat any such resource request as a hard request. Alternatively, we could have a hardResourceRequest and a softResourceRequest. The former is simpler, but the later saves us from talking about this again for DRMAAv3. :)
Thinking about whether a resource request should be an optional attribute makes created in me a doubt about the value of the UnsupportedAttributeException. Should it be possible to have the implementation just ignore unsupported optional attributes? It would certainly be easier than repeatedly attempting to submit until all the offending attributes are removed from the template. Maybe it would help to have the exception detail *all* unsupported attributes at once. Just thinking out loud here...
Daniel -- Nothing is ever easy.