Hi;

 

One thing that I think might be useful is to enumerate use cases to try to identify the simple, common use cases.  In particular, I suspect that looking at what various existing production (as compared to research) meta-schedulers support will be instructive since they already face the task of trying to schedule against programmatically defined interfaces for existing BES-like services (as compared to human clients, who can eyeball a given service's resource descriptions/policies and then perform scheduling decisions in wetware).

 

My first cut at looking at various meta-schedulers – including LSF and Condor – is that the following types of resource descriptions get used:

·        Simple aggregate descriptions, such as the number of available CPUs in a subsidiary scheduling service (e.g. a compute cluster), the average CPU load, and the job queue length.

·        Named queues that jobs can be submitted/forwarded to.  LSF, in particular, allows for the definition of a variety of queues that are effectively globally visible and that a LSF (meta-)scheduler can forward jobs to.  A concrete example is a “fan-in” scenario, in which a large central compute cluster accepts “large jobs” from a set of distributed workgroup clusters to which human users submit their various jobs.  A “large job” queue is defined that all the workgroup cluster schedulers are aware of and users can then submit all their jobs to their local workgroup cluster. For large jobs they submit it to the “large job” queue and the workgroup scheduler forwards jobs received for that queue to the central compute cluster’s job scheduler.

·        “Full” descriptions of subsidiary system compute resources.  In this case the meta-scheduler receives “full” information about all the compute resources that exist in all of the subsidiary scheduling systems.  LSF supports this with a notion of “resource leasing”, where a compute cluster’s LSF scheduler can lease some (or all) of its compute nodes to a remote LSF scheduler.  In that case all the state information that would normally go to the local LSF scheduler about the leased nodes is also forwarded to the remote scheduler owning the lease.  Condor supports something similar with its class-ads design.  A meta-scheduler will receive class-ad descriptions for all the compute nodes that it may do match-making for.  In this case, a “full” description consists of whatever has been put into the class-ads by the individual compute nodes participating in the system.

 

I would love to hear from other members of the community what their characterization of common simple use cases is.  Also, it would be great if people who could provide additional characterization of various existing production meta-schedulers would post such information to the mailing list (or point me to where that information has already been posted if I’m unaware of it J).

 

Several things leap to my mind from looking at these usage examples:

·        A few relatively simple, standardized aggregate description quantities can enable one of the most desirable common use cases, namely spreading volumes of high-throughput jobs across multiple clusters in an approximately load-balanced manner.

·        Condor’s extensible class-ad design, of arbitrary name-value pairs with some number of standardized names, has been fairly successful and provides a lot of flexibility.  As an example, note that LSF’s job forwarding queues can be implemented as class-ad elements.  Note that the open-ended nature of class-ads means that any installation can define its own queues (with associated semantics) that are meaningful within, for example, a particular organization.

·        To efficiently describe something like the leased compute nodes of an LSF cluster or the class-ads for an entire compute cluster may require introducing the notion of arrays of descriptions/class-ads.

·        The key to interoperability is to define a useful set of standard elements that clients can specify in job submission requests and that resource management services (including compute nodes and schedulers) can advertise.  The interesting/key question is how small this set can be while still enabling an interesting set of actual usage scenarios.  It would be interesting to know what LSF exports when leasing compute nodes to remote LSF schedulers and what the “commonly used” set of class-ad terms is across representative sets of Condor installations.  (I’m guessing that the JSDL working group has already looked at questions like this and has some sense of what the answers are?)

 

I know that the JSDL WG is already discussing the topic of class-ad-like approaches.  I guess I’m placing a vote in favor of looking at such a design approach and adding the question of what a beginning “base” set of standardized class-ad names should be.  This would be “one” approach that might be workable without requiring that we first solve one or more research problems.  If JSDL is structured to allow for multiple approaches then it would allow for progress now while not excluding more ambitious approaches of the kind Karl outlined in his email.

 

Marvin.

 

-----Original Message-----
From: owner-ogsa-bes-wg@ggf.org [mailto:owner-ogsa-bes-wg@ggf.org] On Behalf Of Karl Czajkowski
Sent: Saturday, June 10, 2006 10:39 PM
To: Marvin Theimer
Cc: Michel Drescher; Donal K. Fellows; JSDL Working Group; ogsa-bes-wg@ggf.org; Ed Lassettre; Ming Xu (WINDOWS)
Subject: Re: [ogsa-bes-wg] Re: [jsdl-wg] Questions and potential changes to JSDL, as seen from HPC Profile point-of-view

 

Marvin:

 

I think one decision to make is whether BES services are homogeneous

or not.  I think Donal is advocating homogeneity. However, I do not

think this is the main source of complexity.  In either case, I agree

with you that JSDL ought to be usable as a core syntax for describing

the "resources available from a BES instance" as well as the

"resources required for an activity".  As you describe it, this is

sort of a "class ad" in the Condor sense of the word. The problem

comes from trying to advertise a resource that can handle multiple

jobs simultaneously.

 

The tricky part is that this is not just "nodes free", but must be

intersected with policies such as maximum job size.  Should there be a

vocabulary for listing the total free resources and the job sizing

policies directly?  Or should the advertisement list a set of jobs

that can be supported simultaneously, e.g. I publish 512 nodes as

quanity 4 128-node job availability slots?  The latter is easier to

match, but probably doesn't work in the simple case because of

combinatoric problem of grouping jobs which are not maximal.  How does

a user know that they can have quantity 8 64-node jobs or not?

 

Also, I am ignoring the very real problem of capturing per-user

policies. I do not think it is as simple as returning a customized

response for the authenticating client.  How is middleware supposed to

layer on top of BES here? How does a meta-scheduler know whether

quantity 8 64-node jobs can be accepted for one user?  For 8 distinct

users?  Does a (shared) meta-scheduler now need to make separate

queries for every client?  How does it understand the interference of

multiple user jobs?  I think there is really a need for a composite

availability view so such metaschedulers can reasonably think about a

tentative future, in which they try to subdivide and claim parts of

the BES resource for multiple jobs. Can this be handled with a

declarative advertisement, or does it require some transactional

dialogue?  The transactional approach seems too tightly coupled to me,

i.e. I should be able to compute a sensible candidate plan before I

start negotiating.

 

If we say all of this is too "researchy" for standardization, then I

am not sure what the standard will really support. Perhaps the best

approach is the first one I mentioned, where relatively raw data is

exposed on several extensible axes (subject to authorization checks):

overall resource pool descriptions, job sizing policies, user rights

information, etc.  The simple users may only receive a simple subset

of this information which requires minimal transformation to tell them

what they can submit. The middleware clients receive more elaborate

data (if trusted) and can do more elaborate transformation of the data

to help their planning.

 

The only alternative I can imagine, right now, would be a very

elaborate resource description language utilizing the JSDL "range

value" concept to expose some core policy limits, as well as a number

of extensions to express overall constraints which define the outer

bounds of the combinatoric solution space.  This DOES seem pretty

"researchy" to me... but maybe someone else sees a more appealing

middle ground?

 

 

karl

 

--

Karl Czajkowski

karlcz@univa.com