It may well be that things fit nicely into what BES and RSS are targeting
and that would be great. But before we get to that discussion, I’d
like to still spend some more time looking at use cases. In particular,
there are several goals that I think we need to pursue via an examination of
use cases:
·
Identify the simplest case that we
expect everyone to support. The simpler this case is the easier it will
be for interested parties – including most importantly existing legacy
parties – to participate in an HPC grid environment, at least at some basic
level.
·
Identify common cases. One
complaint I’ve repeatedly heard from various people in the distributed
systems and grid communities is that most existing and proposed solutions don’t
make it easy to implement or employ the common cases. We should be identifying
which use cases represent important common cases so that we can then ask
whether and how our solutions will make those as easy as possible to do.
·
Understand how various more
complicated use cases relate to each other. I assert that the key to
designing a successful HPC grid solution will be to come up with an approach
that enables both evolution and selective extension to occur in a straight-forward
and well-defined manner. The combination of a simple base case and a
solid approach to evolution and extensibility will enable both everyone to
participate as well as organic growth of richer functionality among those
parties that need it or are interested in it.
So, I propose that the next steps we should take is to start exploring
various extended HPC use cases with an intent to identifying common cases and
relationships among them. My bulleted list of extensions was intended as
an initial unprioritized list of extension subjects, derived in part from the
compendium of capabilities listed in the use case you originally posted.
I will post my own thoughts on common cases and relationships among extensions
in a future email.
Marvin.
From: Subramaniam,
Ravi [mailto:ravi.subramaniam@intel.com]
Sent: Monday, March 13, 2006 8:40
PM
To:
Subject: RE: [ogsa-wg] Paper
proposing "evolutionary vertical design efforts"
Hi Marvin,
Thanks for documenting and explaining what
you see as the basic HPC case. I agree with your observations and what you have
listed are what most, if not all HPC cluster systems I know of, implement.
I think there is something else that may be
apparent. If I may suggest that the bulleted list with the exception of the
“11th (?)” bullet would nicely fit into the BES
situation. If I look at the cluster you describe (i.e. the client perspective
with queued, running and finished) it seem to me that that would be a BES
container (at least from an interface perspective and where the scheduler that
you are submitting the jobs to is the abstract representation of the BES
“container”). We had discussed this recursive container concept in the
Thanks!
From:
owner-ogsa-wg@ggf.org [mailto:owner-ogsa-wg@ggf.org] On Behalf Of
Sent: Monday, March 13, 2006 2:42
PM
To: Ian Foster; ogsa-wg@ggf.org
Cc:
Subject: RE: [ogsa-wg] Paper
proposing "evolutionary vertical design efforts"
Hi;
Ian, you are correct that I view job submission to a cluster as being
one of the simplest, and hence most basic, HPC use cases to start with.
Or, to be slightly more general, I view job submission to a “black
box” that can run jobs – be it a cluster or an SMP or an SGI NUMA
machine or what-have-you – as being the simplest and hence most basic HPC
use case to start with. The key distinction for me is that the internals
of the “box” are for the most part not visible to the client, at
least as far as submitting and running compute jobs is concerned. There
may well be a separate interface for dealing with things like system
management, but I want to explicitly separate those things out in order to
allow for use of “boxes” that might be managed by proprietary means
or by means obeying standards that a particular job submission client is
unfamiliar with.
I think the use case that Ravi Subramaniam posted to this mailing list
back on 2/17 is a good one to start a discussion around. However,
I’d like to present it from a different point-of-view than he did.
The manner in which the use case is currently presented emphasizes all the
capabilities and services needed to handle the fully general case of submitting
a batch job to a computing utility/service. That’s a great way of
producing a taxonomy against which any given system or design can be compared
to see what it has to offer. I would argue that the next step is to ask
what’s the simplest subset that represents a useful system/design and how
should one categorize the various capabilities and services he has identified
so as to arrive at meaningful components that can be selectively used to obtain
progressively more capable systems.
Another useful exercise to do is to examine existing job scheduling
systems in order to understand what they provide. Since in the real world
we will have to deal with the legacy of existing systems it will be important
to understand how they relate to the use cases we explore. In the same
vein, it will be important to take into account and understand other existing
infrastructures that people use that are related to HPC use cases.
I’m thinking of things like security infrastructures, directory services,
and so forth. From the point-of-view of managing complexity and reducing
total-cost-of-ownership, it will be important to understand the extent to which
existing infrastructure and services can be reused rather than reinvented.
To kick off a discussion around the topic of a minimalist HPC use case,
I present a straw man description of such below and then present a first
attempt at categorizing various areas of extension. The categorization of
extension areas is not meant to be complete or even all that carefully
thought-out as far as componentization boundaries are concerned; it is merely
meant to be a first contribution to get the discussion going.
A basic HPC use case: Compute cluster
embedded within an organization.
· This is your
basic batch job scheduling scenario. Only a very basic state transition
diagram is visible to the client, with the following states for a job: queued,
running, finished. Additional states -- and associated state transition
request operations and functionality -- are not supported. Examples of additional
states and associated functionality include suspension of jobs and migration of
jobs.
· Only
"standard" resources can be described, for example: number of
cpus/nodes needed, memory requirements, disk requirements, etc. (think
resources that are describable by JSDL).
· Once a job has
been submitted it can be cancelled, but its resource requests can't be
modified.
· A distributed
file system is accessible from client desktop machines and client file servers,
as well as compute nodes of the compute cluster. This implies that no
data staging is required, that programs can be (for the most part) executed
from existing file system locations, and that no program
"provisioning" is required (since you can execute them from wherever
they are already installed). Thus in this use case all data transfer and
program installation operations are the responsibility of the user.
· Users already
have accounts within the existing security infrastructure (e.g.
Kerberos). They would like to use these and not have to create/manage
additional authentication/authorization credentials (at least at the level that
is visible to them).
· The job
scheduling service resides at a well-known network name and it is aware of the
compute cluster and its resources by "private" means (e.g. it runs on
the head node of the cluster and employs private means to monitor and control
the resources of the cluster). This implies that there is no need for any
sort of directory services for finding the compute cluster or the resources it
represents other than basic DNS.
· Compute cluster
system management is opaque to users and is the concern of the compute
cluster's owners. This implies that system management is not part of the
compute cluster's public job scheduling interface. This also implies that
there is no need for a logging interface to the service. I assume that
application-level logging can be done by means of libraries that write to
client files; i.e. that there is no need for any sort of special system support
for logging.
· A simple
polling-based interface is the simplest form of interface to something like a
job scheduling service. However, a simple call-back notification interface
is a very useful addition that potentially provides substantial performance
benefits since it can enable the avoidance of lots of unnecessary network
traffic. Only job state changes result in notification messages.
· There are no
notions of fault tolerance. Jobs that fail must be resubmitted by the
client. Neither the cluster head node nor its compute nodes are fault
tolerant. I do expect the client software to return an indication of
failure-due-system-fault when appropriate. (Note that this may also occur
when things like network partitions occur.)
· One does need
some notion of how to deal with orphaned resources and jobs. The notion
of job lifetime and post-expiration garbage collection is a natural approach
here.
· The scheduling
service provides a fixed set of scheduling policies, with only a few basic
choices (or maybe even just one), such as FIFO or round-robin. There is
no notion, in general, of SLAs (which are a form of scheduling policy).
· Enough
information must be returned to the client when a job finishes to enable basic
accounting functionality. This means things like total wall-clock time
the job ran and a summary of resources used. There is not a need for the
interface to support any sort of grouping of accounting information. That
is, jobs do not need to be associated with projects, groups, or other
accounting entities and the job scheduling service is not responsible for
tracking accounting information across such entities. As long as basic
resource utilization information is returnable for each job, accounting can be
done externally to the job scheduling service. I do assume that jobs can
be uniquely identified by some means and can be uniquely associated with some
principal entity existing in the overall system, such as a user name.
· Just as there
is no notion of requiring the job scheduling service to track any but the most
basic job-level accounting information, there is no notion of the service
enforcing quotas on jobs.
· Although it is
generally useful to separate the notions of resource reservation from resource
usage (e.g. to enable interactive and debugging use of resources), it is not a
necessity for the most basic of job scheduling services.
· There is no notion
of tying multiple jobs together, either to support things like dependency
graphs or to support things like workflows. Such capabilities must be
implemented by clients of the job scheduling service.
Interesting extension areas:
·
Additional scheduling policies
o
Weighted fair-share, …
o
Multiple queues
o
SLAs
o
...
·
Extended resource descriptions
o
Additional resource types, such as GPUs
o
Additional types of compute resources, such as desktop computers
o
Condor-style class ads
·
Extended job descriptions (as returned to requesting clients and
sys admins)
·
Additional classes of security credentials
·
Reservations separated from execution
o
Enabling interactive and debugging jobs
o
Support for multiple competing schedulers (incl. desktop cycle
stealing and market-based approaches to scheduling compute resources)
·
Ability to modify jobs during their existence
·
Fault tolerance
o
Automatic rescheduling of jobs that failed due to system faults
o
Highly available resources: This is partly a policy
statement by a scheduling service about its characteristics and partly the
ability to rebind clients to migrated service endpoints
·
Extended state transition diagrams and associated functionalities
o
Job suspension
o
Job migration
o
…
·
Accounting & quotas
·
Operating on arrays of jobs
·
Meta-schedulers, multiple schedulers, and ecologies and
hierarchies of multiple schedulers
o
Meta-schedulers
·
Hierarchical job scheduling with a meta-scheduler as the only
entry point; forwarding jobs to the meta-scheduler from other subsidiary
schedulers
o
Condor-style matchmaking
·
Directory services
o
Using existing directory services
o
Abstract directory service interface(s)
·
Data transfer topics
o
Application data staging
·
Naming
·
Efficiency
·
Convenience
·
Cleanup
o
Program staging/provisioning
·
Description
·
Installation
·
Cleanup
Marvin.
From: Ian Foster
[mailto:foster@mcs.anl.gov]
Sent: Monday, February 20, 2006
9:20 AM
To:
Cc:
Subject: Re: [ogsa-wg] Paper
proposing "evolutionary vertical design efforts"
Dear All:
The most important thing to understand at this point (IMHO) is the scope of
this "HPC use case," as this will determine just how minimal we can
be.
I get the impression that the principal goal may be "job submission to a
cluster." Is that correct? How do we start to circumscribe the scope more
explicitly?
Ian.
At 05:45 AM 2/16/2006 -0800,
Enclosed is a paper that advocates an additional set of activities that
the authors believe that the OGSA working groups should engage in.
Broadly speaking, the OGSA and related working groups are already doing a bunch
of important things:
·
There is broad exploration of the big picture, including
enumeration of use cases, taxonomy of areas, identification of research issues,
etc.
·
There is work going on in each of the horizontal areas that have
been identified, such as
·
There is working going around individual specifications, such as
BES, JSDL, etc.
Given that individual specifications are beginning to come to fruition, the
authors believe it is time to also start defining vertical profilesthat
precisely describe how groups of individual specifications should be employed
to implement specific use cases in an interoperable manner. The authors
also believe that the process of defining these profiles offers an opportunity
to close the design loopby relating the various on-going protocol and standards
efforts back to the use cases in a very concrete manner. This provides an
end-to-end setting in which to identify holes and issues that might require
additional protocols and/or (incremental) changes to existing protocols.
The paper introduces both the general notion of doing focused vertical design
effortsand then focuses on a specific vertical design effort, namely a minimal
HPC design.
The paper derives a specific HPC design in a first principlesmanner since the
authors believe that this increases the chances of identifying issues. As
a consequence, existing specifications and the activities of existing working
groups are not mentioned and this paper is not an attempt to actually define a
specifications profile. Also, the absence of references to existing work
is not meant to imply that such work is in any way irrelevant or
inappropriate. The paper should be viewed as a first abstract attempt to
propose a new kind of activity within OGSA. The expectation is that
future open discussions and publications will explore the concrete details of
such a proposal.
This paper was recently sent to a few key individuals in order to get feedback from
them before submitting it to the wider GGF community. Unfortunately that
process took longer than intended and some members of the community may have
already seen a copy of the paper without knowing the context within it was
written. This email should hopefully dispel any misconceptions that may
have occurred.
For those people who will be around on for the F2F meetings on Friday,
Math & Computer Science Div. Dept of
Computer Science
Argonne National Laboratory The
Tel: 630 252
4619
Fax: 630 252 1997
Globus
Alliance, www.globus.org