
Hi; One thing that I think might be useful is to enumerate use cases to try to identify the simple, common use cases. In particular, I suspect that looking at what various existing production (as compared to research) meta-schedulers support will be instructive since they already face the task of trying to schedule against programmatically defined interfaces for existing BES-like services (as compared to human clients, who can eyeball a given service's resource descriptions/policies and then perform scheduling decisions in wetware). My first cut at looking at various meta-schedulers - including LSF and Condor - is that the following types of resource descriptions get used: * Simple aggregate descriptions, such as the number of available CPUs in a subsidiary scheduling service (e.g. a compute cluster), the average CPU load, and the job queue length. * Named queues that jobs can be submitted/forwarded to. LSF, in particular, allows for the definition of a variety of queues that are effectively globally visible and that a LSF (meta-)scheduler can forward jobs to. A concrete example is a "fan-in" scenario, in which a large central compute cluster accepts "large jobs" from a set of distributed workgroup clusters to which human users submit their various jobs. A "large job" queue is defined that all the workgroup cluster schedulers are aware of and users can then submit all their jobs to their local workgroup cluster. For large jobs they submit it to the "large job" queue and the workgroup scheduler forwards jobs received for that queue to the central compute cluster's job scheduler. * "Full" descriptions of subsidiary system compute resources. In this case the meta-scheduler receives "full" information about all the compute resources that exist in all of the subsidiary scheduling systems. LSF supports this with a notion of "resource leasing", where a compute cluster's LSF scheduler can lease some (or all) of its compute nodes to a remote LSF scheduler. In that case all the state information that would normally go to the local LSF scheduler about the leased nodes is also forwarded to the remote scheduler owning the lease. Condor supports something similar with its class-ads design. A meta-scheduler will receive class-ad descriptions for all the compute nodes that it may do match-making for. In this case, a "full" description consists of whatever has been put into the class-ads by the individual compute nodes participating in the system. I would love to hear from other members of the community what their characterization of common simple use cases is. Also, it would be great if people who could provide additional characterization of various existing production meta-schedulers would post such information to the mailing list (or point me to where that information has already been posted if I'm unaware of it :-)). Several things leap to my mind from looking at these usage examples: * A few relatively simple, standardized aggregate description quantities can enable one of the most desirable common use cases, namely spreading volumes of high-throughput jobs across multiple clusters in an approximately load-balanced manner. * Condor's extensible class-ad design, of arbitrary name-value pairs with some number of standardized names, has been fairly successful and provides a lot of flexibility. As an example, note that LSF's job forwarding queues can be implemented as class-ad elements. Note that the open-ended nature of class-ads means that any installation can define its own queues (with associated semantics) that are meaningful within, for example, a particular organization. * To efficiently describe something like the leased compute nodes of an LSF cluster or the class-ads for an entire compute cluster may require introducing the notion of arrays of descriptions/class-ads. * The key to interoperability is to define a useful set of standard elements that clients can specify in job submission requests and that resource management services (including compute nodes and schedulers) can advertise. The interesting/key question is how small this set can be while still enabling an interesting set of actual usage scenarios. It would be interesting to know what LSF exports when leasing compute nodes to remote LSF schedulers and what the "commonly used" set of class-ad terms is across representative sets of Condor installations. (I'm guessing that the JSDL working group has already looked at questions like this and has some sense of what the answers are?) I know that the JSDL WG is already discussing the topic of class-ad-like approaches. I guess I'm placing a vote in favor of looking at such a design approach and adding the question of what a beginning "base" set of standardized class-ad names should be. This would be "one" approach that might be workable without requiring that we first solve one or more research problems. If JSDL is structured to allow for multiple approaches then it would allow for progress now while not excluding more ambitious approaches of the kind Karl outlined in his email. Marvin. -----Original Message----- From: owner-ogsa-bes-wg@ggf.org [mailto:owner-ogsa-bes-wg@ggf.org] On Behalf Of Karl Czajkowski Sent: Saturday, June 10, 2006 10:39 PM To: Marvin Theimer Cc: Michel Drescher; Donal K. Fellows; JSDL Working Group; ogsa-bes-wg@ggf.org; Ed Lassettre; Ming Xu (WINDOWS) Subject: Re: [ogsa-bes-wg] Re: [jsdl-wg] Questions and potential changes to JSDL, as seen from HPC Profile point-of-view Marvin: I think one decision to make is whether BES services are homogeneous or not. I think Donal is advocating homogeneity. However, I do not think this is the main source of complexity. In either case, I agree with you that JSDL ought to be usable as a core syntax for describing the "resources available from a BES instance" as well as the "resources required for an activity". As you describe it, this is sort of a "class ad" in the Condor sense of the word. The problem comes from trying to advertise a resource that can handle multiple jobs simultaneously. The tricky part is that this is not just "nodes free", but must be intersected with policies such as maximum job size. Should there be a vocabulary for listing the total free resources and the job sizing policies directly? Or should the advertisement list a set of jobs that can be supported simultaneously, e.g. I publish 512 nodes as quanity 4 128-node job availability slots? The latter is easier to match, but probably doesn't work in the simple case because of combinatoric problem of grouping jobs which are not maximal. How does a user know that they can have quantity 8 64-node jobs or not? Also, I am ignoring the very real problem of capturing per-user policies. I do not think it is as simple as returning a customized response for the authenticating client. How is middleware supposed to layer on top of BES here? How does a meta-scheduler know whether quantity 8 64-node jobs can be accepted for one user? For 8 distinct users? Does a (shared) meta-scheduler now need to make separate queries for every client? How does it understand the interference of multiple user jobs? I think there is really a need for a composite availability view so such metaschedulers can reasonably think about a tentative future, in which they try to subdivide and claim parts of the BES resource for multiple jobs. Can this be handled with a declarative advertisement, or does it require some transactional dialogue? The transactional approach seems too tightly coupled to me, i.e. I should be able to compute a sensible candidate plan before I start negotiating. If we say all of this is too "researchy" for standardization, then I am not sure what the standard will really support. Perhaps the best approach is the first one I mentioned, where relatively raw data is exposed on several extensible axes (subject to authorization checks): overall resource pool descriptions, job sizing policies, user rights information, etc. The simple users may only receive a simple subset of this information which requires minimal transformation to tell them what they can submit. The middleware clients receive more elaborate data (if trusted) and can do more elaborate transformation of the data to help their planning. The only alternative I can imagine, right now, would be a very elaborate resource description language utilizing the JSDL "range value" concept to expose some core policy limits, as well as a number of extensions to express overall constraints which define the outer bounds of the combinatoric solution space. This DOES seem pretty "researchy" to me... but maybe someone else sees a more appealing middle ground? karl -- Karl Czajkowski karlcz@univa.com