
As promised on the telecon, I try to summarize the "topology thread." The basic status is that Chris and I seem to be in agreement about a range of resource constraints that need expression and there is a minor open question about how (if at all) the binding of application to resource is expressed. I have talked to Chris and others from Platform a lot over the years, so I guess it is not clear whether we've made progress explaining this position to others. :-/ PROPOSED SET OF CONCEPTS What Chris and I think is that there are three critical concepts that need expression and they all relate to what allocation the job will receive. We think these should be expressed in the resource section: A. Total Number of CPUs for job B. Number of CPUs per resource (e.g. per LSF host) C. Total Number of resources I think both (B) and (C) are tiling constraints and need to default to "unconstrained" if left out of the JSDL doc. It is important to note that by expressing all of (A), (B), and (C) one can overconstrain the selection, e.g. describe redundant things or impossible things. We think this is OK and maybe even necessary in order to allow expression of the important things. All three values should be range-value for generalized constraints w/ step values, limits, etc. Another minor point is whether we want to express shared/exclusive access to resources. Saying "4 CPUs per resource" DOES NOT mean that all allocated hosts will have 4 installed CPUs. It means the job is granted the right to use 4 of the CPUs, but there could be more which are kept idle or allocated to other jobs. REPRESENTATION OF CONCEPTS The total value (A) is the difficult one, because it applies across all Resource elements but does not really belong in the Application section in our view. I suggest that a wrapping layer could be placed around Resource: <jsdl:Resources> <jsdl:TotalCPUs>jsdl:rangeValue</jsdl:TotalCPUs> ? ... other totals for RAM, VM, etc. ... <jsdl:Resource> <jsdl:Count>jsdl:rangeValue</jsdl:Count> ? <jsdl:CPUs>jsdl:rangeValue</jsdl:CPUs> ? ... other per-resource constraints ... </jsdl:Resource> + </jsdl:Resources> Make jsdl:TotalCPUs default to "<exact>1</exact>". Make jsdl:Count and jsdl:CPUs default to unconstrained. The remaining question, again, is how to represent the binding of application to resource. This is where the GRAM single/multi/mpi/condor stuff fits in to state different launch techniques. It might be appropriate to say the default is "site specific" and a reasonable assumption is that a real resource topology is selected by the scheduler and somehow communicated to the job in the form of a resources file, environment setting, or actual process set startup. One can imagine arbitrarily strange and complex scenarios here for the interesting cases. USE CASES TO ILLUSTRATE CONCEPTS Chris made a nice concise statement of use cases that need to be addressed, and I think the implication is that JSDL examples should use these cases rather than existing confusing ones: 1. Simple MPI job. Wants 32 processors with 1 processor per resource (in JSDL, a host is a "resource"). <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> <jsdl:CPUs><exact>1</exact></jsdl:CPUs> ... </jsdl:Resource> </jsdl:Resources> <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> <jsdl:Count><exact>32</exact></jsdl:Count> ... </jsdl:Resource> </jsdl:Resources> <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> <jsdl:Count><exact>32</exact></jsdl:Count> <jsdl:CPUs><exact>1</exact></jsdl:CPUs> ... </jsdl:Resource> </jsdl:Resources> All three of these lead to exactly the same set of possible allocations. I will not show equivalent permutations for the rest of the examples but this gives the basic idea. 2. OpenMPI job. Wants 32 processors with 8 processors per resource. <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> <jsdl:CPUs><exact>8</exact></jsdl:CPUs> ... </jsdl:Resource> </jsdl:Resources> 3. An OpenMP job. Wants 32 processors. Shared mem of course, so one resource. <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> <jsdl:Count><exact>1</exact></jsdl:Count> ... </jsdl:Resource> </jsdl:Resources> 4. A "homegrown" master/slave parallel job (say a ligand docking job). Wants 32 processors. No tiling constraints at all. <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> ... </jsdl:Resource> </jsdl:Resources> * Note that I'm specifically leaving out the Naregi "coupled simulation" use case (sorry guys), since we determined at the last GGF that it was a case which could be decomposed into multiple JSDL documents. Of course, these use cases leave out many interesting "loose" constraints where range-value is used to allow many possibly heterogeneous solutions to the constraint system of the JSDL doc. I would add that I think the "coupled simulation" use case can be addressed by the resource constraints we are proposing but it is in the "binding of application to resource" where multiple JSDL docs are required. For example, we can select heterogeneous resources if the single application knows how to run each different "simulation" on subsets of the allocation without expressing that binding in JSDL: <jsdl:Resources> ... <jsdl:Resource> <jsdl:Count><exact>8</exact></jsdl:Count> <jsdl:CPUs><exact>2</exact></jsdl:CPUs> ... </jsdl:Resource> <jsdl:Resource> <jsdl:Count><exact>16</exact></jsdl:Count> <jsdl:Count><exact>1</exact></jsdl:Count> ... </jsdl:Resource> </jsdl:Resources> If we wanted to address such complex scenarios more elegantly, we could allow more than one Resources element or even allow Resources elements to recursively contain Resources elements in order to express logical assemblies of resources which scope the "total" constraints expressed within them as a summation of the contained assemblies or "leaf" resource elements. I do not know if anyone cares about this sort of hierarchical resource map. karl -- Karl Czajkowski karlcz@univa.com