Re: [SAGA-RG] SAGA Job State Model

23 Mar 2010

      Quoting [Ole Christian Weidner] (Mar 22 2010):
...
On Mar 22, 2010, at 8:25 AM, Andre Merzky wrote:
...
Quoting [Ole Christian Weidner] (Mar 22 2010):
...
Aloha,
what was the reason again *not* to have a "pending" state in the
saga job model?
The decision on what states are on the top level of the SAGA state
model was based on the operations available in the API: only those
states got added which were explicitely reachable via some API
method.
Ok, but why? This is IMHO a pretty random decision.
Might as well be, true - some decision had to be made though, and
that seems as good a guideline as any other.
...
...
A 'Pending' state cannot be reached (or left, depending on
semantics) by any SAGA API call, thus that is only available as
state detail, not as top level state.
What about saga::job::New --run()--> saga::job::Pending. Also, you
could say the same thing about Done and Failed: these states are
not explicitly reachable via a call... and wait() doesn't really
count! if it does, you could also use it to transition from
Pending to Running, Failed, etc.
New->run()->Pending:  Sure, but then you have a transition
Pending->Running which is not expressed at API level.

Done/Failed: correct of course, but yes, we counted wait() to be the
point where the application can sync with the job state.
...
...
...
I'm implementing the third job adaptor (gLite
CREAM) for saga and again, I don't know if I should map gLite's
"pending" state to saga::job::New or saga::job::Running.
It should go to Running (as almost all substates IMHO).  New is
usually defined so that a job does not yet have a backend
representation.
Usually?
Yes: most states in middleware systems are assigned to jobs which
have a backend representation, and are thus in Running state from
the SGA point of view.  An exception I can think of are substates to
Suspended (UserSuspended, SystemSuspended etc), and substates to the
final states (UserFailed, ApplicationFailed, SystemFailed,
UserCanceled, SystemCanceled etc).  Most other states we encountered
and which are specified for the various systems describe details of
a live job (after being accepted by the backend, before being
suspended or finished), and can thus be mapped to Running.
...
...
In Pending states however, most middleware do
already maintain job state.
What do you mean by maintaining a job state?
A better way to express this may be to say: the job has a
representation in the backenend.  I.e., the backend accepted the job
creation request and a job-id exists which uniquely identifies the
job.
...
...
...
Most of the middleware API's out there come with a plethora of
states (e.g. gLite: 11), but most of them map naturally map to
one of the saga job model's states. "Pending" is a state pretty
much used by everyone (Condor, PBS, LSF, Globus, gLite,
GridSAM) and it really doesn't map to saga's model. IMHO it's a
major design flaw - how could this fall through the sieve? Or
is there a  reason behind this?
See above.  As you say, there is a plethora of states, and many
are important for specific use cases.  Other states have been
candidates for SAGA, such as StageIn and StageOut, or Hold, for
all of which exist interesting use cases.  But again: it did not
seem very useful to expose states on the top level which cannot
be reached via API calls - they are then only useful for
informational purposes.  As such, they are still available in
the state details.
But again: why didn't it seem very useful? ;-)
I would be perfectly happy using the state detail. The only
problem with them is that they're absolutely useless without any
formalization. Do you think it would make sense to define an
extended state model (on implementation level) for the state
details? This is IMHO the only way to make use of it
programatically.
The state detail format is specified in GFD.90, as

  State details in SAGA SHOULD be formatted as follows: 
    â<model>:<state>â 
  with valid models being âBESâ, âDRMAAâ, or other implementation
  speciï¬c models. For example, a state detail for the BES state
  âStagingInâ would be rendered as âBES:StagingInâ), and would be a
  substate of Running. If no state details are available, the metric
  is still available, but it has alwaysanempty string value. 

So, 'gLite:Pending' would be what you are looking for, and is should
be possible to be interpreted by the application (it needs to have a
notion what 'Pending' means, and need to look on the second part of
the state detail).

The only more convenient way to expose the state detail I could
think of would be to expose the state details components
individually

  state_detail_model = gLite
  state_detail_value = Pending
...
...
Also, as a last point: the more states we add to SAGA, the more
difficult it is to map to a specific backend state model (DRMAA,
AWS, local, ssh and BES come to my mind which do not have a
Pending, for example).
I don't think that this is a valid point. Why does it become more
difficult? Especially if we're talking about a state that cannot
be reached explicitly: you don't have to worry about it at all. If
SSH doesn't have a "PENDING" state, it will simply never reach it!
The state model is getting more complicated, as you need to allow
state transitions from New to Running to cater for those backends.

For example, we have been considering initially to use the DRMAA.v1
state model, as that was the state of the art at that poit in time
(long time ago).  DRMAA has the following states:

 UNDETERMINED, 
 QUEUED_ACTIVE, 
 SYSTEM_ON_HOLD, 
 USER_ON_HOLD, 
 USER_SYSTEM_ON_HOLD, 
 RUNNING, 
 SYSTEM_SUSPENDED, 
 USER_SUSPENDED, 
 USER_SYSTEM_SUSPENDED, 
 DONE, 
 FAILED

It turned out to be hard to map the globus or gLite states to that
model w/o ending up with an insanely complex state mapping rules.
Thus we went for the simplest state model possible.

Let me turn the question around: what exactly is the use case you
need the Pending state for, and why can't that be solved with the
state_detail?

Finally: if you and other strongly feel that the SAGA state model is
too simple, or the state detail is not accessible enough, we should
certainly reopen the discussion on how those are rendered in the
API.  I doubt that it would be prudent to just change our
implementation though, w/o revising the spec first.

Cheers, Andre.

-- 
Nothing is ever easy.