RE: [ogsa-wg] Paper proposing "evolutionary vertical design efforts"

22 Mar 2006

      Hi;

I have no doubt that it would be relatively easy to add transactional
semantics to most, if not all job schedulers.  In a separate email to
Ian and this mailing list I talk about the potential challenge of doing
so in a manner that is efficient enough to support
"ultra-high-throughput" HPC use cases that I'm aware of.  ASSUMING that
it is indeed difficult to support these existing use cases then I argue
it's better to support transactional job submission semantics as an
almost universally used extension than to simply exclude the use case by
requiring those semantics in the base case.

As I point out in the email, my assumption may be wrong and in fact the
main scheduler vendors/suppliers may all (or mostly all) say that
supporting transactional semantics is either something they already do
or would have no objection to adding.  In that case, we should
definitely add this requirement to the base case and happily move
forward.

Regarding your concern that I'm trying to define as small-as-possible a
base case, I'm not sure how to respond.  An important thing to keep in
mind is that I want to define an HPC profile that covers the common HPC
use cases, not just the common HPC grid use cases.  If the HPC grid
profile doesn't cover the common "in-house" use cases then a second set
of web service protocols will be needed to cover those cases
(interoperability among heterogeneous clusters within an organization is
definitely a common case).  If that happens then we risk almost certain
failure because vendors will not be willing to support two separate
protocol sets and the in-house use cases are currently far more common
than the grid use cases.  Vendors will extend the in-house protocol set
to cover grid use cases and "grid-only" protocols will very likely get
ignored.

That said, I agree with your last paragraph about the requirements for a
design, namely the need for an interoperable interface subset plus a
robust extensibility mechanism that covers the topics you listed.  But I
will argue that transactional semantics are not a REQUIREMENT for
interoperability -- merely something that in MOST cases is enormously
useful.  

Marvin.

-----Original Message-----
From: Karl Czajkowski [mailto:karlcz@univa.com] 
Sent: Tuesday, March 21, 2006 12:50 PM
To: Marvin Theimer
Cc: Carl Kesselman; humphrey@cs.virginia.edu; ogsa-wg@ggf.org
Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"

On Mar 21, Marvin Theimer modulated:
...
Hi;
Whereas I agree with you that at-most-once semantics are very
desirable, I would like to point out that not all existing job
schedulers implement them.  I know that both LSF and CCS (the
Microsoft
HPC job scheduler) don't.  I've been trying to find out whether PBS
and
SGE do or don't.
Aside from the comment Ian made that it is potentially useful to have
at-most-once message semantics even if there is some potential for a
local failure in the message processing (handoff from message layer to
local scheduler), I believe LSF does support "hold" states where a job
can be submitted and released as a two-phase interaction.

Such a mechanism is sufficient to implement a complete end-to-end
at-most-once submission by implementing logging in the message engine
to associate the client message with a local job handle before
submitting.  Most schedulers also support job naming/annotation fields
which are exposed through the job query interface.  This can also be
used to implement a reliable correlation between message/request IDs
and the local implementation job.  This can also be used to synthesize
an at most once semantics in front of the scheduler, by determining if
a local job exists before trying to resubmit with the same name.  This
behavior can be hidden in the message engine and "local adapter".
...
So, this brings up the following slightly more general question:
should
the simplest base case be the simplest case that does something
useful,
or should it be more complicated than that?  I can see good arguments
on both sides:
I find it a little disconcerting that this question is still being
asked about job systems, because there is a history of having made and
retracted this decision before.  We did it in Globus with GRAM, and I
think several of the other Grid projects did as well...

The subset interface is not sufficient for users.  A solution MUST
incorporate an interoperable subset plus a robust extensibility
mechanism to allow any of:

   1. incremental evolution of the core subset
   2. vendor-specific localization/extension
   3. community/site-specific localization/extension
   4. discovery of extended mode support
   5. graceful degradation in the absence of extended mode support

In my opinion, anything short of this will just add another
non-interoperable interface to the hodge-podge of non-interoperable
solutions that already exist.

karl

-- 
Karl Czajkowski
karlcz@univa.com