RE: [ogsa-bes-wg] Re: [jsdl-wg] Questions and potential changes to JSDL, as seen from HPC Profile point-of-view

15 Jun 2006

      Hi;

One thing that I think might be useful is to enumerate use cases to try
to identify the simple, common use cases.  In particular, I suspect that
looking at what various existing production (as compared to research)
meta-schedulers support will be instructive since they already face the
task of trying to schedule against programmatically defined interfaces
for existing BES-like services (as compared to human clients, who can
eyeball a given service's resource descriptions/policies and then
perform scheduling decisions in wetware).

My first cut at looking at various meta-schedulers - including LSF and
Condor - is that the following types of resource descriptions get used:

*        Simple aggregate descriptions, such as the number of available
CPUs in a subsidiary scheduling service (e.g. a compute cluster), the
average CPU load, and the job queue length.

*        Named queues that jobs can be submitted/forwarded to.  LSF, in
particular, allows for the definition of a variety of queues that are
effectively globally visible and that a LSF (meta-)scheduler can forward
jobs to.  A concrete example is a "fan-in" scenario, in which a large
central compute cluster accepts "large jobs" from a set of distributed
workgroup clusters to which human users submit their various jobs.  A
"large job" queue is defined that all the workgroup cluster schedulers
are aware of and users can then submit all their jobs to their local
workgroup cluster. For large jobs they submit it to the "large job"
queue and the workgroup scheduler forwards jobs received for that queue
to the central compute cluster's job scheduler.

*        "Full" descriptions of subsidiary system compute resources.  In
this case the meta-scheduler receives "full" information about all the
compute resources that exist in all of the subsidiary scheduling
systems.  LSF supports this with a notion of "resource leasing", where a
compute cluster's LSF scheduler can lease some (or all) of its compute
nodes to a remote LSF scheduler.  In that case all the state information
that would normally go to the local LSF scheduler about the leased nodes
is also forwarded to the remote scheduler owning the lease.  Condor
supports something similar with its class-ads design.  A meta-scheduler
will receive class-ad descriptions for all the compute nodes that it may
do match-making for.  In this case, a "full" description consists of
whatever has been put into the class-ads by the individual compute nodes
participating in the system.

I would love to hear from other members of the community what their
characterization of common simple use cases is.  Also, it would be great
if people who could provide additional characterization of various
existing production meta-schedulers would post such information to the
mailing list (or point me to where that information has already been
posted if I'm unaware of it :-)).

Several things leap to my mind from looking at these usage examples:

*        A few relatively simple, standardized aggregate description
quantities can enable one of the most desirable common use cases, namely
spreading volumes of high-throughput jobs across multiple clusters in an
approximately load-balanced manner.

*        Condor's extensible class-ad design, of arbitrary name-value
pairs with some number of standardized names, has been fairly successful
and provides a lot of flexibility.  As an example, note that LSF's job
forwarding queues can be implemented as class-ad elements.  Note that
the open-ended nature of class-ads means that any installation can
define its own queues (with associated semantics) that are meaningful
within, for example, a particular organization.

*        To efficiently describe something like the leased compute nodes
of an LSF cluster or the class-ads for an entire compute cluster may
require introducing the notion of arrays of descriptions/class-ads.

*        The key to interoperability is to define a useful set of
standard elements that clients can specify in job submission requests
and that resource management services (including compute nodes and
schedulers) can advertise.  The interesting/key question is how small
this set can be while still enabling an interesting set of actual usage
scenarios.  It would be interesting to know what LSF exports when
leasing compute nodes to remote LSF schedulers and what the "commonly
used" set of class-ad terms is across representative sets of Condor
installations.  (I'm guessing that the JSDL working group has already
looked at questions like this and has some sense of what the answers
are?) 

I know that the JSDL WG is already discussing the topic of class-ad-like
approaches.  I guess I'm placing a vote in favor of looking at such a
design approach and adding the question of what a beginning "base" set
of standardized class-ad names should be.  This would be "one" approach
that might be workable without requiring that we first solve one or more
research problems.  If JSDL is structured to allow for multiple
approaches then it would allow for progress now while not excluding more
ambitious approaches of the kind Karl outlined in his email.

Marvin.

-----Original Message-----
From: owner-ogsa-bes-wg@ggf.org [mailto:owner-ogsa-bes-wg@ggf.org] On
Behalf Of Karl Czajkowski
Sent: Saturday, June 10, 2006 10:39 PM
To: Marvin Theimer
Cc: Michel Drescher; Donal K. Fellows; JSDL Working Group;
ogsa-bes-wg@ggf.org; Ed Lassettre; Ming Xu (WINDOWS)
Subject: Re: [ogsa-bes-wg] Re: [jsdl-wg] Questions and potential changes
to JSDL, as seen from HPC Profile point-of-view

Marvin:

I think one decision to make is whether BES services are homogeneous

or not.  I think Donal is advocating homogeneity. However, I do not

think this is the main source of complexity.  In either case, I agree

with you that JSDL ought to be usable as a core syntax for describing

the "resources available from a BES instance" as well as the

"resources required for an activity".  As you describe it, this is

sort of a "class ad" in the Condor sense of the word. The problem

comes from trying to advertise a resource that can handle multiple

jobs simultaneously.

The tricky part is that this is not just "nodes free", but must be

intersected with policies such as maximum job size.  Should there be a

vocabulary for listing the total free resources and the job sizing

policies directly?  Or should the advertisement list a set of jobs

that can be supported simultaneously, e.g. I publish 512 nodes as

quanity 4 128-node job availability slots?  The latter is easier to

match, but probably doesn't work in the simple case because of

combinatoric problem of grouping jobs which are not maximal.  How does

a user know that they can have quantity 8 64-node jobs or not?

Also, I am ignoring the very real problem of capturing per-user

policies. I do not think it is as simple as returning a customized

response for the authenticating client.  How is middleware supposed to

layer on top of BES here? How does a meta-scheduler know whether

quantity 8 64-node jobs can be accepted for one user?  For 8 distinct

users?  Does a (shared) meta-scheduler now need to make separate

queries for every client?  How does it understand the interference of

multiple user jobs?  I think there is really a need for a composite

availability view so such metaschedulers can reasonably think about a

tentative future, in which they try to subdivide and claim parts of

the BES resource for multiple jobs. Can this be handled with a

declarative advertisement, or does it require some transactional

dialogue?  The transactional approach seems too tightly coupled to me,

i.e. I should be able to compute a sensible candidate plan before I

start negotiating.

If we say all of this is too "researchy" for standardization, then I

am not sure what the standard will really support. Perhaps the best

approach is the first one I mentioned, where relatively raw data is

exposed on several extensible axes (subject to authorization checks):

overall resource pool descriptions, job sizing policies, user rights

information, etc.  The simple users may only receive a simple subset

of this information which requires minimal transformation to tell them

what they can submit. The middleware clients receive more elaborate

data (if trusted) and can do more elaborate transformation of the data

to help their planning.

The only alternative I can imagine, right now, would be a very

elaborate resource description language utilizing the JSDL "range

value" concept to expose some core policy limits, as well as a number

of extensions to express overall constraints which define the outer

bounds of the combinatoric solution space.  This DOES seem pretty

"researchy" to me... but maybe someone else sees a more appealing

middle ground?

karl

-- 

Karl Czajkowski

karlcz@univa.com