Hmm, I probably shouldn’t send long
messages from my blackberry. Hopefully, you can get through the typos and get
my point.
Carl
Dr. Carl
Kesselman
email: carl@isi.edu
USC/Information Sciences Institute
WWW: http://www.isi.edu/~carl
4676 Admiralty Way, Suite
1001 Phone:
(310) 448-9338
From:
owner-ogsa-wg@ggf.org [mailto:owner-ogsa-wg@ggf.org] On Behalf Of Carl Kesselman
Sent: Thursday, March 16, 2006
12:49 AM
To: theimer@microsoft.com
Cc: humphrey@cs.virginia.edu;
ogsa-wg@ggf.org
Subject: Re: [ogsa-wg] Paper
proposing "evolutionary vertical design efforts"
Hi,
In the interest of furthering agreement, I was not arguing that the application
had to be restartable. Rather, what has been shown to be important is that the
protocol be restartable in the following sense: if you submit a job and
the far and server fails, is the job running or not, if you resubmit, do you
get another job instance. The GT sumbission protocol and Condor have a
transactional semantics so that you can have at most once submit semantics
reegardless of client and server failures. The fact that your application may
be non-itempote is exactly why having a well defined semantics in this case is
important.
So what is the next step?
Carl
Dr. Carl
Kesselman
email: carl@isi.edu
USC/Information Sciences Institute
WWW: http://www.isi.edu/~carl
4676 Admiralty Way, Suite
1001 Phone:
(310) 448-9338
-----Original Message-----
From: Marvin Theimer <theimer@microsoft.com>
To: Carl Kesselman <carl@isi.edu>
CC: Marvin Theimer <theimer@microsoft.com>; Marty Humphrey
<humphrey@cs.virginia.edu>; ogsa-wg@ggf.org <ogsa-wg@ggf.org>
Sent: Wed Mar 15 14:26:36 2006
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"
Hi;
I suspect that we’re mostly in agreement on things. In particular,
I think your list of four core aspects is a great starting point for a
discussion on the topic.
I just replied to an earlier email from
· Identification of the simplest base
case that everyone will have to implement.
· Identification of common cases we
want to optimize.
· Identification of how evolution and
selective extension will work.
I totally agree with you that the base use case I described isn’t really
a “grid” use case. But it is an HPC use case – in fact
it is arguably the most common use case in current existence. J So I
think it’s important that we understand how to seamlessly integrate and
support that common – and very simple – use case.
I also totally agree with you that we can’t let a solution to the
simplest HPC use case paint us into a corner that prevents supporting the
richer use cases that grid computing is all about. That’s why
I’d like to spend significant effort exploring and understanding the
issues of how to support evolution and selective extension. In an ideal
world a legacy compute cluster job scheduler could have a simple “grid
shim” that let it participate at a basic level, in a natural manner, in a
grid environment, while smarter clients and HPC services could interoperate
with each other in various selectively richer manners by means of extensions to
the basic HPC grid design.
One place where I disagree with you is your assertion that everything needs to
be designed to be restartable. While that’s a good goal to pursue
I’m not convinced that you can achieve it in all cases. In
particular, there are at least two cases that I claim we want to support that
aren’t restartable:
· We want to be able to run
applications that aren’t restartable; for example, because they perform
non-idempotent operations on the external physical environment. If such
an application fails during execution then the only one who can figure out what
the proper next steps are is the end user.
· We want to be able to include
(often-times legacy) systems that aren’t fault tolerant, such as simple
small compute clusters where the owners didn’t think that fault tolerance
was worth paying for.
Of course any acceptable design will have to enable systems that are fault
tolerant to export/expose that capability. To my mind it’s more a
matter of ensuring that non-fault-tolerant systems aren’t excluded from
participation in a grid.
Other things we agree on:
· We should certainly examine what
remote job submission systems do. We should certainly look at existing
systems like Globus, Unicore, and Legion. In general, we should be
looking at everything that has any actual experience that we can learn from and
everything that is actually deployed and hence represents a system that we
potentially need to interoperate with. (Whether a final design is
actually able to interoperate at any but the most basic level with various
exotic existing systems is a separate issue.)
· We should absolutely focus on
codifying what we know how to do and avoid doing research as part of a
standards process. I believe that thinking carefully about how to support
evolution and extension is our best hope for allowing people to defer trying to
bake their pet research topic into standards since it provides a story for why
today’s standards don’t preclude tomorrow’s improvements.
So I would propose that next steps are:
· Continue to explore and classify
various HPC use cases of various differing levels of complexity.
· Describe the requirements –
and limitations – of existing job scheduling and remote job submission
systems.
· Continue identifying and discussing
key “features” of use cases and potential design solutions, such as
the four that you identified in your last email.
Marvin.
________________________________
From: Carl Kesselman [mailto:carl@isi.edu]
Sent: Tuesday, March 14, 2006 7:50 AM
To: Marty Humphrey; ogsa-wg@ggf.org
Cc: Marvin Theimer
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"
Hi,
Just to be clear, I’m not trying to suggest that the scope be expanded. I
agree with the approach of focusing on a baby step is a good one, and many of
the assumptions stated in Marvin’s list I am in total agreement with.
However, in taking baby steps I think that it is important that we end up
walking, and that in defining the use case, one can easily create solutions
that will not get you to the next step. This is my point about looking at what
we know how to do and have been doing in production settings for many years
now. In my mind, one of the scope grandness problems has been that there has
been far too little focus on codifying what we know how to do in favor of using
a standards process as an excuse to design new things. So at the risk of
sounding partisan, the simplified use case that Marvin is proposing is exactly
the use case that GRAM has been doing for over ten years now (I think the same
can be said about UNICORE and Legion).
So let me try to be constructive. One of the things that falls out
of Marvin’s list could be a set of basic concepts/operations that need to
be defined. These include:
1) A way of describing “local” job configuration, i.e. where to
find the executable, data files, etc. This should be very conservative with its
assumptions on shared file systems and accessibility. In general, what needs to
be stated here are what are the underlying aspects of the underlying resource
that are exposed to the outward facing interface.
2) A way of naming a submission point (should probably have a way of modeling
queues).
3) A core set of job management operations, submit, status, kill. These need to
be defined in such a way at to be tolerate to a variety of failure scenarios,
in that the state needs to be well defined in the case of failure.
4) A state model that one can use to describe what is going on with the jobs
and a way to access that state. Can be simple (queued, running, done),
may need to be extensible. One can view the accounting information as
being exposed
So, one thing to do would be to agree that these are (or are not) the right
four things that need to be defined and if so, start to flesh out these in a
way that supports the core use case but doesn’t introduce assumptions
that would preclude more complex use cases in the future.
Carl
________________________________
From: owner-ogsa-wg@ggf.org [mailto:owner-ogsa-wg@ggf.org]
On Behalf Of Marty Humphrey
Sent: Tuesday, March 14, 2006 6:32 AM
To: ogsa-wg@ggf.org
Cc: 'Marvin Theimer'
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"
Carl,
Your comments are very important. We would love to have your active
participation in this effort. Your experience is, of course, matched by few!
I re-emphasize that this represents (my words, not anyone else’s)
“baby steps” that are necessary and important for the Grid
community. In my opinion, the biggest challenge will be to fight the urge
to expand the scope beyond a small size. You cannot ignore the possibility that
the GGF has NOT made as much progress as it should have to date. Furthermore,
one such plausible explanation is that the scope is too grand.
-- Marty
________________________________
From: owner-ogsa-wg@ggf.org [mailto:owner-ogsa-wg@ggf.org]
On Behalf Of Carl Kesselman
Sent: Tuesday, March 14, 2006 8:47 AM
To: Marvin Theimer; Ian Foster; ogsa-wg@ggf.org
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"
Hi,
While I have no wish to engage in the “what is a Grid” argument,
there are some elements of your base use case that I would be concerned
about. Specifically, the assumption that the submission in into a
“local cluster” on which there is an existing account may lead one
to a solution that may not generalize to the solution to the case of submission
across autonomous policy domains. I would also argue that ignoring issues
of fault tolerance from the beginning is also problematic. One must at
least design operations that are restartable (for example at most once
submission semantics).
I would finally suggest that while examining existing job schedule systems is a
good thing to do, we should also examine existing remote submission systems
(dare I say Grid systems). The basic HPC use case is one in which there
is a significant amount implementation and usage experience.
Thanks,
Carl
________________________________
From: owner-ogsa-wg@ggf.org [mailto:owner-ogsa-wg@ggf.org]
On Behalf Of Marvin Theimer
Sent: Monday, March 13, 2006 2:42 PM
To: Ian Foster; ogsa-wg@ggf.org
Cc: Marvin Theimer
Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"
Hi;
Ian, you are correct that I view job submission to a cluster as being one of
the simplest, and hence most basic, HPC use cases to start with. Or, to
be slightly more general, I view job submission to a “black box”
that can run jobs – be it a cluster or an SMP or an SGI NUMA machine or
what-have-you – as being the simplest and hence most basic HPC use case
to start with. The key distinction for me is that the internals of the
“box” are for the most part not visible to the client, at least as
far as submitting and running compute jobs is concerned. There may well
be a separate interface for dealing with things like system management, but I
want to explicitly separate those things out in order to allow for use of
“boxes” that might be managed by proprietary means or by means
obeying standards that a particular job submission client is unfamiliar with.
I think the use case that Ravi Subramaniam posted to this mailing list back on
2/17 is a good one to start a discussion around. However, I’d like
to present it from a different point-of-view than he did. The manner in
which the use case is currently presented emphasizes all the capabilities and
services needed to handle the fully general case of submitting a batch job to a
computing utility/service. That’s a great way of producing a
taxonomy against which any given system or design can be compared to see what
it has to offer. I would argue that the next step is to ask what’s
the simplest subset that represents a useful system/design and how should one
categorize the various capabilities and services he has identified so as to
arrive at meaningful components that can be selectively used to obtain progressively
more capable systems.
Another useful exercise to do is to examine existing job scheduling systems in
order to understand what they provide. Since in the real world we will
have to deal with the legacy of existing systems it will be important to understand
how they relate to the use cases we explore. In the same vein, it will be
important to take into account and understand other existing infrastructures
that people use that are related to HPC use cases. I’m thinking of
things like security infrastructures, directory services, and so forth.
>From the point-of-view of managing complexity and reducing
total-cost-of-ownership, it will be important to understand the extent to which
existing infrastructure and services can be reused rather than reinvented.
To kick off a discussion around the topic of a minimalist HPC use case, I
present a straw man description of such below and then present a first attempt
at categorizing various areas of extension. The categorization of
extension areas is not meant to be complete or even all that carefully
thought-out as far as componentization boundaries are concerned; it is merely
meant to be a first contribution to get the discussion going.
A basic HPC use case: Compute cluster embedded within an organization.
· This is your basic batch job scheduling
scenario. Only a very basic state transition diagram is visible to the
client, with the following states for a job: queued, running, finished.
Additional states -- and associated state transition request operations and
functionality -- are not supported. Examples of additional states and
associated functionality include suspension of jobs and migration of jobs.
· Only "standard" resources can be described,
for example: number of cpus/nodes needed, memory requirements, disk
requirements, etc. (think resources that are describable by JSDL).
· Once a job has been submitted it can be cancelled,
but its resource requests can't be modified.
· A distributed file system is accessible from client
desktop machines and client file servers, as well as compute nodes of the
compute cluster. This implies that no data staging is required, that
programs can be (for the most part) executed from existing file system
locations, and that no program "provisioning" is required (since you
can execute them from wherever they are already installed). Thus in this
use case all data transfer and program installation operations are the
responsibility of the user.
· Users already have accounts within the existing security
infrastructure (e.g. Kerberos). They would like to use these and not have
to create/manage additional authentication/authorization credentials (at least
at the level that is visible to them).
· The job scheduling service resides at a well-known
network name and it is aware of the compute cluster and its resources by
"private" means (e.g. it runs on the head node of the cluster and
employs private means to monitor and control the resources of the
cluster). This implies that there is no need for any sort of directory
services for finding the compute cluster or the resources it represents other
than basic DNS.
· Compute cluster system management is opaque to users
and is the concern of the compute cluster's owners. This implies that
system management is not part of the compute cluster's public job scheduling
interface. This also implies that there is no need for a logging
interface to the service. I assume that application-level logging can be
done by means of libraries that write to client files; i.e. that there is no
need for any sort of special system support for logging.
· A simple polling-based interface is the simplest form
of interface to something like a job scheduling service. However, a
simple call-back notification interface is a very useful addition that
potentially provides substantial performance benefits since it can enable the
avoidance of lots of unnecessary network traffic. Only job state changes
result in notification messages.
· There are no notions of fault tolerance. Jobs
that fail must be resubmitted by the client. Neither the cluster head
node nor its compute nodes are fault tolerant. I do expect the client
software to return an indication of failure-due-system-fault when appropriate.
(Note that this may also occur when things like network partitions occur.)
· One does need some notion of how to deal with
orphaned resources and jobs. The notion of job lifetime and
post-expiration garbage collection is a natural approach here.
· The scheduling service provides a fixed set of
scheduling policies, with only a few basic choices (or maybe even just one),
such as FIFO or round-robin. There is no notion, in general, of SLAs
(which are a form of scheduling policy).
· Enough information must be returned to the client
when a job finishes to enable basic accounting functionality. This means
things like total wall-clock time the job ran and a summary of resources
used. There is not a need for the interface to support any sort of
grouping of accounting information. That is, jobs do not need to be
associated with projects, groups, or other accounting entities and the job
scheduling service is not responsible for tracking accounting information
across such entities. As long as basic resource utilization information
is returnable for each job, accounting can be done externally to the job
scheduling service. I do assume that jobs can be uniquely identified by
some means and can be uniquely associated with some principal entity existing
in the overall system, such as a user name.
· Just as there is no notion of requiring the job
scheduling service to track any but the most basic job-level accounting
information, there is no notion of the service enforcing quotas on jobs.
· Although it is generally useful to separate the
notions of resource reservation from resource usage (e.g. to enable interactive
and debugging use of resources), it is not a necessity for the most basic of
job scheduling services.
· There is no notion of tying multiple jobs together,
either to support things like dependency graphs or to support things like
workflows. Such capabilities must be implemented by clients of the job
scheduling service.
Interesting extension areas:
· Additional scheduling policies
o Weighted fair-share, …
o Multiple queues
o SLAs
o ...
· Extended resource descriptions
o Additional resource types, such as GPUs
o Additional types of compute resources, such as
desktop computers
o Condor-style class ads
· Extended job descriptions (as returned to
requesting clients and sys admins)
· Additional classes of security credentials
· Reservations separated from execution
o Enabling interactive and debugging jobs
o Support for multiple competing schedulers (incl.
desktop cycle stealing and market-based approaches to scheduling compute
resources)
· Ability to modify jobs during their existence
· Fault tolerance
o Automatic rescheduling of jobs that failed due to
system faults
o Highly available resources: This is partly a
policy statement by a scheduling service about its characteristics and partly
the ability to rebind clients to migrated service endpoints
· Extended state transition diagrams and
associated functionalities
o Job suspension
o Job migration
o …
· Accounting & quotas
· Operating on arrays of jobs
· Meta-schedulers, multiple schedulers, and
ecologies and hierarchies of multiple schedulers
o Meta-schedulers
· Hierarchical job scheduling with a
meta-scheduler as the only entry point; forwarding jobs to the meta-scheduler
from other subsidiary schedulers
o Condor-style matchmaking
· Directory services
o Using existing directory services
o Abstract directory service interface(s)
· Data transfer topics
o Application data staging
· Naming
· Efficiency
· Convenience
· Cleanup
o Program staging/provisioning
· Description
· Installation
· Cleanup
Marvin.
________________________________
From: Ian Foster [mailto:foster@mcs.anl.gov]
Sent: Monday, February 20, 2006 9:20 AM
To: Marvin Theimer; ogsa-wg@ggf.org
Cc: Marvin Theimer; Savas Parastatidis; Tony Hey; Marty Humphrey;
gcf@grids.ucs.indiana.edu
Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design
efforts"
Dear All:
The most important thing to understand at this point (IMHO) is the scope of
this "HPC use case," as this will determine just how minimal we can
be.
I get the impression that the principal goal may be "job submission to a
cluster." Is that correct? How do we start to circumscribe the scope more
explicitly?
Ian.
At 05:45 AM 2/16/2006 -0800, Marvin Theimer wrote:
Enclosed is a paper that advocates an additional set of activities that the
authors believe that the OGSA working groups should engage in.
Broadly speaking, the OGSA and related working groups are already doing a bunch
of important things:
· There is broad exploration of
the big picture, including enumeration of use cases, taxonomy of areas,
identification of research issues, etc.
· There is work going on in
each of the horizontal areas that have been identified, such as
· There is working going around
individual specifications, such as BES, JSDL, etc.
Given that individual specifications are beginning to come to fruition, the
authors believe it is time to also start defining vertical profilesthat
precisely describe how groups of individual specifications should be employed
to implement specific use cases in an interoperable manner. The authors
also believe that the process of defining these profiles offers an opportunity
to close the design loopby relating the various on-going protocol and standards
efforts back to the use cases in a very concrete manner. This provides an
end-to-end setting in which to identify holes and issues that might require
additional protocols and/or (incremental) changes to existing protocols.
The paper introduces both the general notion of doing focused vertical design
effortsand then focuses on a specific vertical design effort, namely a minimal
HPC design.
The paper derives a specific HPC design in a first principlesmanner since the
authors believe that this increases the chances of identifying issues. As
a consequence, existing specifications and the activities of existing working
groups are not mentioned and this paper is not an attempt to actually define a
specifications profile. Also, the absence of references to existing work
is not meant to imply that such work is in any way irrelevant or
inappropriate. The paper should be viewed as a first abstract attempt to
propose a new kind of activity within OGSA. The expectation is that
future open discussions and publications will explore the concrete details of
such a proposal.
This paper was recently sent to a few key individuals in order to get feedback
from them before submitting it to the wider GGF community. Unfortunately
that process took longer than intended and some members of the community may
have already seen a copy of the paper without knowing the context within it was
written. This email should hopefully dispel any misconceptions that may
have occurred.
For those people who will be around on for the F2F meetings on Friday, Marvin
Theimer will be giving a talk on the contents of this paper at a time and place
to be announced.
Marvin Theimer, Savas Parastatidis, Tony Hey, Marty Humphrey, Geoffrey Fox
_______________________________________________________________
Ian
Foster
www.mcs.anl.gov/~foster
Math & Computer Science Div. Dept of Computer Science
Argonne National Laboratory The
Tel: 630 252
4619
Fax: 630 252 1997
Globus Alliance, www.globus.org <http://www.globus.org/>