Attendees: Peter Troeger Dan Templeton Hrabri Rajic Daniel Gruber 1. Meeting secretary for this meeting? Daniel Gruber 2. Next IDL language binding ? Python binding was finished within 3 months -> efficient Do we want to do it again for Java / C binding? -> low priority -> DRMAA2 is currently more important 3. DRMAA2 schedule (requested by Steven) -> doable within 6 months -> some time for approval after that -> there are two OGF meetings in this time (26th and 27th) -> earlier agreement: fixing everything we identified 4. DRMAA2 feature discussion Do we need "System Hold"/ "User Hold" differentiation? -> idea: reason as exception? -> agreed on that it is not really necessary Undetermined - is it a valid job state? -> Yes! Undetermined = Error -> Condor: it is permanent some time -> Need to clarify if this means "don't try again" or "try it again" Distinction between state "failed" and "terminated" -> "Failed" := user can fix it (through changes on job template for example) -> "Terminated" := error the user can't fix TODO later: DRMAA wait state will be discussed other time Why should we support extensible state? -> basically for reporting -> problem: difficult to implement in C Solution for more fine grained states (and "User/System Hold" problem) -> add another output parameter (a DRM specific string) or constants for job state -> hence "Hold" could be one state -> fine grained states could additionally retrieved via such DRM specific string or constant -> proposal: one string and a reference to a generic object/struct -> will appear in draft later on! Problem with state transition in Condor from "Suspend" to "Queued Active"! -> should be supported (at least 3 people agreed) 5. Continue with email threads for discussion. 6. Next meeting: 3rd February 2009
On Tue, Jan 20, 2009 at 7:50 PM, Daniel Gruber <D.Gruber@sun.com> wrote:
Undetermined - is it a valid job state? -> Yes! Undetermined = Error -> Condor: it is permanent some time -> Need to clarify if this means "don't try again" or "try it again"
But does that mean that undeteimined state will go away and the function will return an error?
Distinction between state "failed" and "terminated" -> "Failed" := user can fix it (through changes on job template for example) -> "Terminated" := error the user can't fix
I thought as Terminated as the state the job gets into if it was drmaa_controll'ed() or possibly deleted locally in DRMS (by admin or user), but the later may be optional functionality.
Why should we support extensible state? -> basically for reporting -> problem: difficult to implement in C
It might be modelled similarly to BES so that there are standard states that one can additionally inherit from to have more detailed states. In C it might done in the following way (kind of OOP programming in C): typedef struct { int standard_state; } drmaa_state_t; That would be standardised. But the implementation might want to extend it and then it might actually return: typedef { drmaa_state_t super; int my_own_specific_state; } drmaa_sge_state_t; If the "client" wants to use only standard states, it uses a pointer to the first structure and thus doesn't see the detailed state (e.g. general hold state + user/admin hold implementation specific). But when he knows he's using a specific DRMAA implementation it may cast the general structure to the impl-specific one. Kind of a hack, but AFAIR it is C standards compliant. Pointers to these two structures should be interchangeable, because they point to the same place in memory. -- Piotr Domagalski FedStage Systems Ltd.
Piotr Domagalski wrote:
On Tue, Jan 20, 2009 at 7:50 PM, Daniel Gruber <D.Gruber@sun.com> wrote:
Undetermined - is it a valid job state? -> Yes! Undetermined = Error -> Condor: it is permanent some time -> Need to clarify if this means "don't try again" or "try it again"
But does that mean that undeteimined state will go away and the function will return an error?
It probably means that UNDETERMINED becomes one state, and an exception becomes the other.
Distinction between state "failed" and "terminated" -> "Failed" := user can fix it (through changes on job template for example) -> "Terminated" := error the user can't fix
I thought as Terminated as the state the job gets into if it was drmaa_controll'ed() or possibly deleted locally in DRMS (by admin or user), but the later may be optional functionality.
Exactly. A failed job needs to be fixed before being resubmitted. A terminated job could succeed as-is if resubmitted.
Why should we support extensible state? -> basically for reporting -> problem: difficult to implement in C
It might be modelled similarly to BES so that there are standard states that one can additionally inherit from to have more detailed states. In C it might done in the following way (kind of OOP programming in C):
typedef struct { int standard_state; } drmaa_state_t;
That would be standardised. But the implementation might want to extend it and then it might actually return:
typedef { drmaa_state_t super; int my_own_specific_state; } drmaa_sge_state_t;
If the "client" wants to use only standard states, it uses a pointer to the first structure and thus doesn't see the detailed state (e.g. general hold state + user/admin hold implementation specific). But when he knows he's using a specific DRMAA implementation it may cast the general structure to the impl-specific one. Kind of a hack, but AFAIR it is C standards compliant. Pointers to these two structures should be interchangeable, because they point to the same place in memory.
I was thinking about something more along the lines of: typedef struct { int state; void *substate; } drmaa_state_t; There is then no confusion for the caller about what he gets back. He only needs to check is the substate is non-null *if* he knows enough about the DRMAA implementation to be able to understand it. It also leaves the substate implementation open for the implementation to decide. Maybe an int just isn't enough data. Or maybe the substate is a string message. Daniel
Am 21.01.2009 um 12:59 schrieb Piotr Domagalski:
On Tue, Jan 20, 2009 at 7:50 PM, Daniel Gruber <D.Gruber@sun.com> wrote:
Undetermined - is it a valid job state? -> Yes! Undetermined = Error -> Condor: it is permanent some time -> Need to clarify if this means "don't try again" or "try it again"
But does that mean that undeteimined state will go away and the function will return an error?
We will vote about this on the next conf call. /Peter.
participants (4)
-
Daniel Gruber -
Daniel Templeton -
Peter Tröger -
Piotr Domagalski