Re: [graap-wg] A Highly available, Fault tolerant Co-scheduling System

24 Oct 2005

      Hi Karl,

I guess that ultimately, I don't have a lot of enthusiasm for the  
combination you propose in the message.  The messaging in Paxos has  
been well thought out.  I don't think layering them on top of the WS- 
Agreement call-response pattern would work well (at least it would  
not work as well for failure cases - when everything is working it'd  
be fine).  Also, it's more messages than are necessary.

The devil, as always is in the details.  Indeed, it's when you  
consider the messaging patterns that you see the problems with what  
you proposed before about using WS-Agreement between the Acceptors  
and RMs.  Although you wouldn't make Paxos inconsistent (this is  
impossible), you would raise the chances of the transaction being  
needlessly aborted in certain failure modes.

As for mentioning the user, I make no apology for this.  I don't  
think that it's "telling" at all.  The protocol we proposed can do  
machine-to-machine stuff.  I know this, because I have written  
clients for the protocol, and it's very easy.  It should be clear  
from the slides I presented that this is the case.  (There is  
certainly nothing as complicated as the Template stuff in WS- 
Agreement, where a client has to download a template, understand the  
format and then construct a document based upon that.)

Cheers,

Jon.

On Oct 11, 2005, at 12:03 AM, Karl Czajkowski wrote:
...
On Oct 10, Jon MacLaren modulated:
...
Karl,
Thanks for the email.  I was sorry that you weren't at the
presentation.
I've replied to stuff inline below.  However, a couple of general
points/observations.
First, what would I gain from using WS-Agreement in the way you
propose?  At the moment, we have a nice co-allocation scheme,  
where the
co-allocators don't need to know anything about the payload of the
message - it can even be encrypted  (An important separation, in my
opinion.)  Also, my scheme currently uses XML over HTTP.  It could  
use
WS-I, if we wanted to add SOAP.  But it's just XML messages,
essentially.  If WS-Agreement was "merged" with this scheme, I'd then
need to use WS-RF, which is not amenable to everyone.  (In any case,
the impression from reading this below, is that to use WS- 
Agreement in
this way feels like a bit of a hack.)
<putting on my GRAAP-takes-the-world hat>
What you would gain is access to a world full of resources managed by
WS-Agreement!
<taking off my hat>
What the community would gain is precisely the goal of GRAAP: to
improve resource federation in practice by having common/normalized RM
services that are able to support a range of different virtual
organizations and distributed management strategies.
...
Also, below, you are talking about using WS-Agreement as the protocol
between the entity doing the co-scheduling and the resource managers
(RMs).  I'm not envisaging the user doing co-scheduling directly -  
it's
complex, and I'd rather encapsulate it, as in my implementation.  If
you had a service doing this for you, your scheme would need two  
levels
of WS-Agreement, between the user and co-scheduler and the co- 
scheduler
and RMs.  Is this what you imagine, or do you think the user should
take this on directly?
Yes, I imagine two (or more) levels.
I find it telling that you brought "user" into the picture.  The core
thrust of SNAP and all my inputs to GRAAP have been on the
machine-to-machine RM agreement angle.  I completely expect to see
WS-Agreement in the layer between brokers and resources.  That it
might also serve in the layer between users/clients and the "first"
broker is almost accidental.  I say almost, because the ability for
the model to recurse cleanly was always important in its design.
...
I remember.  As I pointed out, this hides the nature of what is going
on (co-allocation) from the resource manager (RM).  In my
implementation, the RM is aware of the difference between a prepare->
prepared->abort sequence and a normal reserve followed by  
cancellation.
  Hiding this difference, as you propose, has implications on  
charging
schemes, allocation quotas, etc. - it is, as I believe, too
restrictive.
I agree it has implications, but I think it is also a prerequisite to
true federation of resources.  Just as I make reservations with
multiple providers when I co-allocate my travel itinerary, I think
users (or their agents/services) in a large-scale Grid are going to
have to make agreements with autonomous resource providers.  These
resource providers may have no interest (or trust) in each other, but
only in their relationship to the user/consumer.
I would point out though, that the domain-specific semantics of the
agreement could be extended to include contextual information about
the co-allocation goals.  This could be useful for audit,
authorization, or even differentiated pricing if the co-allocation is
being managed by a broker/transaction manager that the resources trust
to do a good job and not cause thrashing!  On the other extreme, you
could imagine some RMs who _only_ want to hear from accredited
brokers, because they do not trust end-users to make worthwhile
requests.
...
You can't rely on the creation of the RP thing in order to  
discover the
decision later on.  What if the RM is down?
I guess I don't understand Paxos from my quick reading... how is the
RM being down any different than message loss that is supposed to be
tolerated?  Any remote process has a tri-state understanding of the
RM's status in the transaction: prepared, not prepared, or unknown.
Right?
I think it would be interesting to see how something like Paxos can be
layered on top of "normal" RM messages instead of deploying
Paxos-specific entities to each resource.  It's all just a matter of
syntax, assuming the semantics of preparation and commitment can be
mapped appropriately.  I understand there could be significant legwork
to re-validate the formal proofs about Paxos, and no I am not
volunteering. :-)
...
This is true.  However, Paxos handles message delays/non-arrival by
having subsequent ballots.  It recovers automatically from this - it
doesn't just block.  So individual messages being delayed is not a
problem.  For Paxos not to make progress, you need to engineer a
situation where there is no majority of acceptors still working.   
What
do you think the chances are of messages being systematically delayed
between a number of processes?
It isn't a concern of blocking that I raised.  It is that
co-allocation of real resources, e.g. simultaneous use of computers,
is based in wall-clock time and the abstract ACID transactions only
hold water as long as the commit phase completes before the actual
wall-clock time when the co-allocation is meant to commence.  You
cannot "fix it up" with persistent logs after the fact, if the time
actually elapsed and the resources were not operating in the allocated
mode. (Conversely, you cannot do speculative allocation and then undo
in the event of a transaction abort. There is an opportunity cost
either way.)
This is the core concern I have been raising for years now about the
inherent hazard of distributed resource management.  I keep raising it
because I think people focus on the wrong kinds of "reliability" and
"correctness" metrics when talking about things like co-scheduling of
distributed computations and data-paths. I worry that we're somehow
talking past each other if you still think I am just talking about
"progress" in the abstract transactional sense.  Failed transactions
that lead to idle resources is a potential livelock hazard for the
resource operators, no matter how elegantly the consensus problem is
phrased to suggest it completed. :-)
One solution to this problem is markets: specifically by having cost
models for reservation and cancellation, market forces can push the
risks out to the coordinators who are trying to make risky
transactions.  This gives them incentive to act as efficiently as they
know how.
...
If you crunch the numbers on all these failures (I used an example of
acceptors being inoperable for one hour out of 24 hours), you find  
that
the likelihood of a 5-acceptor Paxos round blocking is very, very  
small
(once in a number of years).
That's good enough for me.
Jon.
I have my doubts about the failure model, e.g. pairwise messaging
failures between an RM and an acceptor are not really independent in
the Internet, since quite likely a large number of acceptors are using
the same network links to talk to a particular RM.  Partitioning can
be unkind.
However, I am not really trying to debate the pros or cons of Paxos
per se, but to understand how we can get to a world where standard,
normalized, and interoperable RM services can be deployed and shared
by different brokers, VOs, and coordination strategies.  I think the
architecture should be very agnostic and policy-free so that different
policies and "markets" can evolve.  Your pursuit of this other
coordination strategy makes you an interesting candidate to talk to
about WS-Agreement mechanisms... it isn't so interesting to preach to
the choir.
I think the future of Grid computing is in the human policies and
federating models, and not in the plumbing.  The plumbing just needs
to be there and be well behaved, without obstructing the kinds of
experimental and production policies that organizations wish to
deploy.
karl
-- 
Karl Czajkowski
karlcz@univa.com