Hi;
One thing that I think might be useful is to enumerate use cases to try
to identify the simple, common use cases. In particular, I suspect that looking
at what various existing production (as compared to research) meta-schedulers
support will be instructive since they already face the task of trying to
schedule against programmatically defined interfaces for existing BES-like
services (as compared to human clients, who can eyeball a given service's
resource descriptions/policies and then perform scheduling decisions in
wetware).
My first cut at looking at various meta-schedulers – including LSF
and Condor – is that the following types of resource descriptions get
used:
·
Simple aggregate descriptions,
such as the number of available CPUs in a subsidiary scheduling service (e.g. a
compute cluster), the average CPU load, and the job queue length.
·
Named queues that jobs can be
submitted/forwarded to. LSF, in particular, allows for the definition of
a variety of queues that are effectively globally visible and that a LSF (meta-)scheduler
can forward jobs to. A concrete example is a “fan-in”
scenario, in which a large central compute cluster accepts “large jobs”
from a set of distributed workgroup clusters to which human users submit their
various jobs. A “large job” queue is defined that all the
workgroup cluster schedulers are aware of and users can then submit all their
jobs to their local workgroup cluster. For large jobs they submit it to the “large
job” queue and the workgroup scheduler forwards jobs received for that
queue to the central compute cluster’s job scheduler.
·
“Full” descriptions of
subsidiary system compute resources. In this case the meta-scheduler
receives “full” information about all the compute resources that exist
in all of the subsidiary scheduling systems. LSF supports this with a
notion of “resource leasing”, where a compute cluster’s LSF
scheduler can lease some (or all) of its compute nodes to a remote LSF
scheduler. In that case all the state information that would normally go
to the local LSF scheduler about the leased nodes is also forwarded to the
remote scheduler owning the lease. Condor supports something similar with
its class-ads design. A meta-scheduler will receive class-ad descriptions
for all the compute nodes that it may do match-making for. In this case,
a “full” description consists of whatever has been put into the
class-ads by the individual compute nodes participating in the system.
I would love to hear from other members of the community what their
characterization of common simple
use cases is. Also, it would be great if people who could provide
additional characterization of various existing production meta-schedulers
would post such information to the mailing list (or point me to where that
information has already been posted if I’m unaware of it J).
Several things leap to my mind from looking at these usage examples:
·
A few relatively simple,
standardized aggregate description quantities can enable one of the most
desirable common use cases, namely spreading volumes of high-throughput jobs
across multiple clusters in an approximately load-balanced manner.
·
Condor’s extensible class-ad
design, of arbitrary name-value pairs with some number of standardized names, has
been fairly successful and provides a lot of flexibility. As an example,
note that LSF’s job forwarding queues can be implemented as class-ad
elements. Note that the open-ended nature of class-ads means that any
installation can define its own queues (with associated semantics) that are
meaningful within, for example, a particular organization.
·
To efficiently describe something
like the leased compute nodes of an LSF cluster or the class-ads for an entire
compute cluster may require introducing the notion of arrays of
descriptions/class-ads.
·
The key to interoperability is to
define a useful set of standard elements that clients can specify in job
submission requests and that resource management services (including compute
nodes and schedulers) can advertise. The interesting/key question is how
small this set can be while still enabling an interesting set of actual usage
scenarios. It would be interesting to know what LSF exports when leasing
compute nodes to remote LSF schedulers and what the “commonly used”
set of class-ad terms is across representative sets of Condor installations.
(I’m guessing that the JSDL working group has already looked at questions
like this and has some sense of what the answers are?)
I know that the JSDL WG is already discussing the topic of
class-ad-like approaches. I guess I’m placing a vote in favor of
looking at such a design approach and adding the question of what a beginning “base”
set of standardized class-ad names should be. This would be “one”
approach that might be workable without requiring that we first solve one or
more research problems. If JSDL is structured to allow for multiple approaches
then it would allow for progress now while not excluding more ambitious
approaches of the kind Karl outlined in his email.
Marvin.
-----Original Message-----
From: owner-ogsa-bes-wg@ggf.org [mailto:owner-ogsa-bes-wg@ggf.org] On Behalf Of
Karl Czajkowski
Sent: Saturday, June 10, 2006 10:39 PM
To: Marvin Theimer
Cc: Michel Drescher; Donal K. Fellows; JSDL Working Group; ogsa-bes-wg@ggf.org;
Ed Lassettre; Ming Xu (WINDOWS)
Subject: Re: [ogsa-bes-wg] Re: [jsdl-wg] Questions and potential changes to
JSDL, as seen from HPC Profile point-of-view
Marvin:
I think one decision to make is whether BES services are homogeneous
or not. I think Donal is advocating homogeneity. However, I do
not
think this is the main source of complexity. In either case, I
agree
with you that JSDL ought to be usable as a core syntax for describing
the "resources available from a BES instance" as well as the
"resources required for an activity". As you describe
it, this is
sort of a "class ad" in the Condor sense of the word. The problem
comes from trying to advertise a resource that can handle multiple
jobs simultaneously.
The tricky part is that this is not just "nodes free", but
must be
intersected with policies such as maximum job size. Should there
be a
vocabulary for listing the total free resources and the job sizing
policies directly? Or should the advertisement list a set of jobs
that can be supported simultaneously, e.g. I publish 512 nodes as
quanity 4 128-node job availability slots? The latter is easier
to
match, but probably doesn't work in the simple case because of
combinatoric problem of grouping jobs which are not maximal. How
does
a user know that they can have quantity 8 64-node jobs or not?
Also, I am ignoring the very real problem of capturing per-user
policies. I do not think it is as simple as returning a customized
response for the authenticating client. How is middleware
supposed to
layer on top of BES here? How does a meta-scheduler know whether
quantity 8 64-node jobs can be accepted for one user? For 8
distinct
users? Does a (shared) meta-scheduler now need to make separate
queries for every client? How does it understand the interference
of
multiple user jobs? I think there is really a need for a
composite
availability view so such metaschedulers can reasonably think about a
tentative future, in which they try to subdivide and claim parts of
the BES resource for multiple jobs. Can this be handled with a
declarative advertisement, or does it require some transactional
dialogue? The transactional approach seems too tightly coupled to
me,
i.e. I should be able to compute a sensible candidate plan before I
start negotiating.
If we say all of this is too "researchy" for standardization,
then I
am not sure what the standard will really support. Perhaps the best
approach is the first one I mentioned, where relatively raw data is
exposed on several extensible axes (subject to authorization checks):
overall resource pool descriptions, job sizing policies, user rights
information, etc. The simple users may only receive a simple
subset
of this information which requires minimal transformation to tell them
what they can submit. The middleware clients receive more elaborate
data (if trusted) and can do more elaborate transformation of the data
to help their planning.
The only alternative I can imagine, right now, would be a very
elaborate resource description language utilizing the JSDL "range
value" concept to expose some core policy limits, as well as a
number
of extensions to express overall constraints which define the outer
bounds of the combinatoric solution space. This DOES seem pretty
"researchy" to me... but maybe someone else sees a more
appealing
middle ground?
karl
--
Karl Czajkowski
karlcz@univa.com