
As promised on the telecon, I try to summarize the "topology thread." The basic status is that Chris and I seem to be in agreement about a range of resource constraints that need expression and there is a minor open question about how (if at all) the binding of application to resource is expressed. I have talked to Chris and others from Platform a lot over the years, so I guess it is not clear whether we've made progress explaining this position to others. :-/ PROPOSED SET OF CONCEPTS What Chris and I think is that there are three critical concepts that need expression and they all relate to what allocation the job will receive. We think these should be expressed in the resource section: A. Total Number of CPUs for job B. Number of CPUs per resource (e.g. per LSF host) C. Total Number of resources I think both (B) and (C) are tiling constraints and need to default to "unconstrained" if left out of the JSDL doc. It is important to note that by expressing all of (A), (B), and (C) one can overconstrain the selection, e.g. describe redundant things or impossible things. We think this is OK and maybe even necessary in order to allow expression of the important things. All three values should be range-value for generalized constraints w/ step values, limits, etc. Another minor point is whether we want to express shared/exclusive access to resources. Saying "4 CPUs per resource" DOES NOT mean that all allocated hosts will have 4 installed CPUs. It means the job is granted the right to use 4 of the CPUs, but there could be more which are kept idle or allocated to other jobs. REPRESENTATION OF CONCEPTS The total value (A) is the difficult one, because it applies across all Resource elements but does not really belong in the Application section in our view. I suggest that a wrapping layer could be placed around Resource: <jsdl:Resources> <jsdl:TotalCPUs>jsdl:rangeValue</jsdl:TotalCPUs> ? ... other totals for RAM, VM, etc. ... <jsdl:Resource> <jsdl:Count>jsdl:rangeValue</jsdl:Count> ? <jsdl:CPUs>jsdl:rangeValue</jsdl:CPUs> ? ... other per-resource constraints ... </jsdl:Resource> + </jsdl:Resources> Make jsdl:TotalCPUs default to "<exact>1</exact>". Make jsdl:Count and jsdl:CPUs default to unconstrained. The remaining question, again, is how to represent the binding of application to resource. This is where the GRAM single/multi/mpi/condor stuff fits in to state different launch techniques. It might be appropriate to say the default is "site specific" and a reasonable assumption is that a real resource topology is selected by the scheduler and somehow communicated to the job in the form of a resources file, environment setting, or actual process set startup. One can imagine arbitrarily strange and complex scenarios here for the interesting cases. USE CASES TO ILLUSTRATE CONCEPTS Chris made a nice concise statement of use cases that need to be addressed, and I think the implication is that JSDL examples should use these cases rather than existing confusing ones: 1. Simple MPI job. Wants 32 processors with 1 processor per resource (in JSDL, a host is a "resource"). <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> <jsdl:CPUs><exact>1</exact></jsdl:CPUs> ... </jsdl:Resource> </jsdl:Resources> <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> <jsdl:Count><exact>32</exact></jsdl:Count> ... </jsdl:Resource> </jsdl:Resources> <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> <jsdl:Count><exact>32</exact></jsdl:Count> <jsdl:CPUs><exact>1</exact></jsdl:CPUs> ... </jsdl:Resource> </jsdl:Resources> All three of these lead to exactly the same set of possible allocations. I will not show equivalent permutations for the rest of the examples but this gives the basic idea. 2. OpenMPI job. Wants 32 processors with 8 processors per resource. <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> <jsdl:CPUs><exact>8</exact></jsdl:CPUs> ... </jsdl:Resource> </jsdl:Resources> 3. An OpenMP job. Wants 32 processors. Shared mem of course, so one resource. <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> <jsdl:Count><exact>1</exact></jsdl:Count> ... </jsdl:Resource> </jsdl:Resources> 4. A "homegrown" master/slave parallel job (say a ligand docking job). Wants 32 processors. No tiling constraints at all. <jsdl:Resources> <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs> ... <jsdl:Resource> ... </jsdl:Resource> </jsdl:Resources> * Note that I'm specifically leaving out the Naregi "coupled simulation" use case (sorry guys), since we determined at the last GGF that it was a case which could be decomposed into multiple JSDL documents. Of course, these use cases leave out many interesting "loose" constraints where range-value is used to allow many possibly heterogeneous solutions to the constraint system of the JSDL doc. I would add that I think the "coupled simulation" use case can be addressed by the resource constraints we are proposing but it is in the "binding of application to resource" where multiple JSDL docs are required. For example, we can select heterogeneous resources if the single application knows how to run each different "simulation" on subsets of the allocation without expressing that binding in JSDL: <jsdl:Resources> ... <jsdl:Resource> <jsdl:Count><exact>8</exact></jsdl:Count> <jsdl:CPUs><exact>2</exact></jsdl:CPUs> ... </jsdl:Resource> <jsdl:Resource> <jsdl:Count><exact>16</exact></jsdl:Count> <jsdl:Count><exact>1</exact></jsdl:Count> ... </jsdl:Resource> </jsdl:Resources> If we wanted to address such complex scenarios more elegantly, we could allow more than one Resources element or even allow Resources elements to recursively contain Resources elements in order to express logical assemblies of resources which scope the "total" constraints expressed within them as a summation of the contained assemblies or "leaf" resource elements. I do not know if anyone cares about this sort of hierarchical resource map. karl -- Karl Czajkowski karlcz@univa.com

On Apr 27, Karl Czajkowski loaded a tape reading: ...
<jsdl:Resources> <jsdl:TotalCPUs>jsdl:rangeValue</jsdl:TotalCPUs> ? ... other totals for RAM, VM, etc. ... <jsdl:Resource> <jsdl:Count>jsdl:rangeValue</jsdl:Count> ? <jsdl:CPUs>jsdl:rangeValue</jsdl:CPUs> ? ... other per-resource constraints ... </jsdl:Resource> * <jsdl:Resources /> * </jsdl:Resources> *
I have changed the cardinality as I think it should be if we want the general case: 0-N Resources clauses since a job may have none? 0-N Resource clauses per Resources because it may have only "global" constraints 0-N nested Resources if you want the hierarchical model. in this case, we need to define whether "global" attributes summarize all Resource + Resources children or ONLY Resource children. I advocate the first total summation of all children. to address Andreas's problem, I think we should add an attribute to the Resources element: <jsdl:Resources resourceModel="spaceshare"> ... we need to define a few values and also assert a default model to assume if it is not present. I suggest spaceshare as the default because I am biased towards batch jobs. :-) - spaceshare: the resources content describes the virtual "portion" of resource that the job gets to use, which MAY be a small part of a larger physical resource complex - physical: the resources content describes a real physical resource complex (for Andreas's provisioning use case) - virtual: should we presume a virtual machine version of the physical scenario? this seems to be a new popular datacenter trick to get "mainframe" like behavior from commodity hardware... - other: some other extended content SHOULD identify the model interpretation to apply. karl -- Karl Czajkowski karlcz@univa.com

I thought a little bit more about this problem Andreas has and I think I jumped to a solution too quickly. Presumably, notions like whether you are getting "bare metal" or "virtual machines" or "posix process startup" is indicated in the application type. So, the only remaining question is whether there is some cross-cutting option that could change the behavior for a given application model. Let me give a sort of weak "proof by contradiction". Assuming there is some cross-cutting concept here, I think it would be called something like "allocation model" and it has two enumerated values I can think of right now: "total" allocations are what Andreas wants for his provisioning of node(s) for OS images. "partial" allocations are what Chris wants for placement of jobs onto shares resources. The difference is that the partial allocation says "job is allocated an abstract resource R1 which satisfies the resource selection criteria and MAY be part of a larger resource R2". The total allocation says that "job is allocated a resource R and that's that". Here's the contradiction: the difference between these is really subtle and has to do with the "opacity" of the resource abstraction (the so called "job container") within which the job is run. If the job container is opaque enough, the job cannot tell whether it got a total or partial allocation: a scheduler for POSIX jobs could use system partitioning controls to create what looks like a smaller SMP out of a larger SMP, for example! However, a typical POSIX shared resource has a transparent job container where jobs can detect the larger, shared resource's characteristics. In addition, whether the allocation is advisory or enforced is another related but separate issue. I do not see how we can generically model either of these aspects of allocation without citing behavioral aspects of a particular application type's environment. For me, one question remains: do we think it is in scope to define different types of resource abstraction for the posix application type? In other words, does the JSDL posix application element need a way to express whether the apparent machine (visible using posix system interfaces) is allowed to be "larger" than the allocation given to the job or not? I am not sure there is a need. The scientific community has been using the transparent abstraction for years, where the system MAY be larger and it is best for the application to "avert its eyes". :-) Besides, we've already punted on how the application learns what its allocation is and how the application processes get started. This allocation opacity issue seems quite metaphysical in comparison... I definitely think it is out of scope whether an "opaque" allocation is using real "bare hardware" or some sort of virtualized machine, as the distinction comes down to very different QoS concerns like processor and memory performance. This should be handled through extensions and additional context like WS-Agreement, etc. karl -- Karl Czajkowski karlcz@univa.com

Karl, To use the example in the teleconference, all I want to do is be able to make the distinction between the following statements in the resource description : 1. X CPUs any way you can 2. X CPUs on the same machine 3. X CPUS on the same machine and that's all that machine should have and similarly for memory, for example. I take a very narrow, perhaps simplistic view, that this is merely syntax. And so find it very hard to accept arguments that effectively come down to saying that making the distinction between statement 1 and 2 is good and making the distinction between statements 2 and 3 is not. Andreas Karl Czajkowski wrote:
I thought a little bit more about this problem Andreas has and I think I jumped to a solution too quickly.
Presumably, notions like whether you are getting "bare metal" or "virtual machines" or "posix process startup" is indicated in the application type. So, the only remaining question is whether there is some cross-cutting option that could change the behavior for a given application model.
Let me give a sort of weak "proof by contradiction". Assuming there is some cross-cutting concept here, I think it would be called something like "allocation model" and it has two enumerated values I can think of right now:
"total" allocations are what Andreas wants for his provisioning of node(s) for OS images.
"partial" allocations are what Chris wants for placement of jobs onto shares resources.
The difference is that the partial allocation says "job is allocated an abstract resource R1 which satisfies the resource selection criteria and MAY be part of a larger resource R2". The total allocation says that "job is allocated a resource R and that's that".
Here's the contradiction: the difference between these is really subtle and has to do with the "opacity" of the resource abstraction (the so called "job container") within which the job is run. If the job container is opaque enough, the job cannot tell whether it got a total or partial allocation: a scheduler for POSIX jobs could use system partitioning controls to create what looks like a smaller SMP out of a larger SMP, for example! However, a typical POSIX shared resource has a transparent job container where jobs can detect the larger, shared resource's characteristics. In addition, whether the allocation is advisory or enforced is another related but separate issue. I do not see how we can generically model either of these aspects of allocation without citing behavioral aspects of a particular application type's environment.
For me, one question remains: do we think it is in scope to define different types of resource abstraction for the posix application type? In other words, does the JSDL posix application element need a way to express whether the apparent machine (visible using posix system interfaces) is allowed to be "larger" than the allocation given to the job or not? I am not sure there is a need. The scientific community has been using the transparent abstraction for years, where the system MAY be larger and it is best for the application to "avert its eyes". :-) Besides, we've already punted on how the application learns what its allocation is and how the application processes get started. This allocation opacity issue seems quite metaphysical in comparison...
I definitely think it is out of scope whether an "opaque" allocation is using real "bare hardware" or some sort of virtualized machine, as the distinction comes down to very different QoS concerns like processor and memory performance. This should be handled through extensions and additional context like WS-Agreement, etc.
karl
-- Andreas Savva Fujitsu Laboratories Ltd

On May 12, Andreas Savva loaded a tape reading:
Karl,
To use the example in the teleconference, all I want to do is be able to make the distinction between the following statements in the resource description : 1. X CPUs any way you can 2. X CPUs on the same machine 3. X CPUS on the same machine and that's all that machine should have
and similarly for memory, for example.
I take a very narrow, perhaps simplistic view, that this is merely syntax. And so find it very hard to accept arguments that effectively come down to saying that making the distinction between statement 1 and 2 is good and making the distinction between statements 2 and 3 is not.
Andreas
The distinction of 1 vs. 2 and 2 vs. 3 are not even on the same conceptual "axis"! We already have the ability to allocate whole clusters of nodes so there is: 1. X CPUs any way you can 2. X CPUs in some complex partially constrained topology 3. X CPUs on one host and the notion of "I got all CPUs on the/each host" is an orthogonal question demanding an orthogonal syntax. I think all 6 combinations are meaningful in the abstract: 1. any way you can and I mind/don't mind sharing node(s) 2. some complex topology and I mind/don't mind sharing node(s) 3. all on one host and I mind/don't mind sharing the host Clearly, one does not demand that they get all hosts in the cluster, e.g. if they get the M nodes they want they do not care if they were allocated from a larger pool of P nodes. At least, no scheduler makes this an explicit mode of "get all nodes" versus "get N and I just so happen to know that N=P". This distinction you are after is of the same vein---did I get all CPUs per node or just _enough_ CPUs to match my constraint. I am sure we can model this concept with another "total/partial" or "shared/unshared" attribute in the resource clause, but I am less sure that this is actually a good concept to model. As I said in the previous message, I do not really see how to separate it cleanly from the application model. What does it mean to share or not share the node? Except in your extreme case of loading an OS image (clearly, not a POSIX app), and some old Cray computers that didn't really timeshare, even a supposedly dedicated compute node usually has some other system processes on it that are not part of the job! Where is the line drawn between "background system processes I will ignore" and "other processes that annoy me"? I think this requires some more precise QoS terminology in the job description. karl -- Karl Czajkowski karlcz@univa.com

Karl Czajkowski wrote:
The distinction of 1 vs. 2 and 2 vs. 3 are not even on the same conceptual "axis"! [...] Where is the line drawn between "background system processes I will ignore" and "other processes that annoy me"? I think this requires some more precise QoS terminology in the job description.
I'm wondering what is the minimum we can describe in order to get a nicely usable JSDL 1.0 spec is? I do not mind if there are distinctions that we omit for 1.0 on the grounds of getting the spec done sooner rather than later, but I'll merrily admit to not understanding all the ramifications of the processor allocation models. Simple to describe and simple to implement feel like key goals to aim for to me. :-) In other words, let us finalize a version for 1.0 and then revisit the area in more detail later on if it is warranted. Donal.

Donal K. Fellows wrote:
Karl Czajkowski wrote:
The distinction of 1 vs. 2 and 2 vs. 3 are not even on the same conceptual "axis"!
[...]
Where is the line drawn between "background system processes I will ignore" and "other processes that annoy me"? I think this requires some more precise QoS terminology in the job description.
I'm wondering what is the minimum we can describe in order to get a nicely usable JSDL 1.0 spec is? I do not mind if there are distinctions that we omit for 1.0 on the grounds of getting the spec done sooner rather than later, but I'll merrily admit to not understanding all the ramifications of the processor allocation models. Simple to describe and simple to implement feel like key goals to aim for to me. :-)
In other words, let us finalize a version for 1.0 and then revisit the area in more detail later on if it is warranted.
Right. I would be satisfied with a suitably named(*) attribute/element in the 'Resource' section that allows me to make the distinction I described; and if not present defaults to the behaviour Karl wants. This isn't an attempt to wreak havoc on people's resource allocation policies or somehow bring the grid to a grinding halt. :-) (*) Sorry it's been a long day (and week) and I have no facility to come up with suitable names at the moment. -- Andreas Savva Fujitsu Laboratories Ltd

Hi Karl, Karl Czajkowski wrote:
On Apr 27, Karl Czajkowski loaded a tape reading: ...
<jsdl:Resources> <jsdl:TotalCPUs>jsdl:rangeValue</jsdl:TotalCPUs> ? ... other totals for RAM, VM, etc. ... <jsdl:Resource> <jsdl:Count>jsdl:rangeValue</jsdl:Count> ? <jsdl:CPUs>jsdl:rangeValue</jsdl:CPUs> ? ... other per-resource constraints ... </jsdl:Resource> *
<jsdl:Resources /> *
</jsdl:Resources> *
I have changed the cardinality as I think it should be if we want the general case:
0-N Resources clauses since a job may have none? 0-N Resource clauses per Resources because it may have only "global" constraints 0-N nested Resources if you want the hierarchical model. in this case, we need to define whether "global" attributes summarize all Resource + Resources children or ONLY Resource children. I advocate the first total summation of all children.
I can appreciate the generality of this proposal but I'm wondering if we could get away with a smaller change to the spec at this stage? For example, keep the <jsdl:Resource> sections as in the current spec and introduce a separate <jsdl:AggregateResources> (or similarly named) section instead. I vaguely recall (but couldn't locate) an email where you proposed something like that. So, how about <jsdl:AggregateResources> <jsdl:TotalCPUs>jsdl:rangeValue</jsdl:TotalCPUs> ? ... other totals for RAM, VM, etc. ... <jsdl:AggregateResources> ? <jsdl:Resource> <jsdl:Count>jsdl:rangeValue</jsdl:Count> ? <jsdl:CPUs>jsdl:rangeValue</jsdl:CPUs> ? ... other per-resource constraints ... </jsdl:Resource> * I think doing it this way would also allow us, if we want, to structure the Aggregate Resources section as an extension to the base JSDL spec and include it in the specification in the same way as POSIXApplication.
to address Andreas's problem, I think we should add an attribute to the Resources element:
<jsdl:Resources resourceModel="spaceshare"> ...
And put this attribute on the <jsdl:Resource> element.
we need to define a few values and also assert a default model to assume if it is not present. I suggest spaceshare as the default because I am biased towards batch jobs. :-)
I like the values and their definitions. And I second the proposal that spaceshare should be the default. :-)
- spaceshare: the resources content describes the virtual "portion" of resource that the job gets to use, which MAY be a small part of a larger physical resource complex
- physical: the resources content describes a real physical resource complex (for Andreas's provisioning use case)
- virtual: should we presume a virtual machine version of the physical scenario? this seems to be a new popular datacenter trick to get "mainframe" like behavior from commodity hardware...
- other: some other extended content SHOULD identify the model interpretation to apply.
karl
Cheers, Andreas
participants (3)
-
Andreas Savva
-
Donal K. Fellows
-
Karl Czajkowski