[jsdl-wg] summary of topology thread

27 Apr 2005

      As promised on the telecon, I try to summarize the "topology thread."
The basic status is that Chris and I seem to be in agreement about a
range of resource constraints that need expression and there is a
minor open question about how (if at all) the binding of application
to resource is expressed.  I have talked to Chris and others from
Platform a lot over the years, so I guess it is not clear whether
we've made progress explaining this position to others. :-/

PROPOSED SET OF CONCEPTS

What Chris and I think is that there are three critical concepts that
need expression and they all relate to what allocation the job will
receive. We think these should be expressed in the resource section:

   A. Total Number of CPUs for job

   B. Number of CPUs per resource (e.g. per LSF host)

   C. Total Number of resources

I think both (B) and (C) are tiling constraints and need to default to
"unconstrained" if left out of the JSDL doc.

It is important to note that by expressing all of (A), (B), and (C)
one can overconstrain the selection, e.g. describe redundant things or
impossible things.  We think this is OK and maybe even necessary in
order to allow expression of the important things.  All three values
should be range-value for generalized constraints w/ step values,
limits, etc.

Another minor point is whether we want to express shared/exclusive
access to resources.  Saying "4 CPUs per resource" DOES NOT mean that
all allocated hosts will have 4 installed CPUs.  It means the job is
granted the right to use 4 of the CPUs, but there could be more which
are kept idle or allocated to other jobs.

REPRESENTATION OF CONCEPTS

The total value (A) is the difficult one, because it applies across
all Resource elements but does not really belong in the Application
section in our view.  I suggest that a wrapping layer could be
placed around Resource:

  <jsdl:Resources>
     <jsdl:TotalCPUs>jsdl:rangeValue</jsdl:TotalCPUs> ?
     ... other totals for RAM, VM, etc. ...
     <jsdl:Resource>
        <jsdl:Count>jsdl:rangeValue</jsdl:Count> ?
        <jsdl:CPUs>jsdl:rangeValue</jsdl:CPUs> ?
        ... other per-resource constraints ...
     </jsdl:Resource> +
  </jsdl:Resources>

Make jsdl:TotalCPUs default to "<exact>1</exact>".

Make jsdl:Count and jsdl:CPUs default to unconstrained.

The remaining question, again, is how to represent the binding of
application to resource.  This is where the GRAM
single/multi/mpi/condor stuff fits in to state different launch
techniques.  It might be appropriate to say the default is "site
specific" and a reasonable assumption is that a real resource topology
is selected by the scheduler and somehow communicated to the job in
the form of a resources file, environment setting, or actual process
set startup. One can imagine arbitrarily strange and complex scenarios
here for the interesting cases.

USE CASES TO ILLUSTRATE CONCEPTS

Chris made a nice concise statement of use cases that need to be
addressed, and I think the implication is that JSDL examples should
use these cases rather than existing confusing ones:

   1. Simple MPI job. Wants 32 processors with 1 processor per
      resource (in JSDL, a host is a "resource").

      <jsdl:Resources>
         <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs>
         ...
         <jsdl:Resource>
            <jsdl:CPUs><exact>1</exact></jsdl:CPUs>
            ...
         </jsdl:Resource>
      </jsdl:Resources>

      <jsdl:Resources>
         <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs>
         ...
         <jsdl:Resource>
            <jsdl:Count><exact>32</exact></jsdl:Count>
            ...
         </jsdl:Resource>
      </jsdl:Resources>

      <jsdl:Resources>
         <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs>
         ...
         <jsdl:Resource>
            <jsdl:Count><exact>32</exact></jsdl:Count>
            <jsdl:CPUs><exact>1</exact></jsdl:CPUs>
            ...
         </jsdl:Resource>
      </jsdl:Resources>

      All three of these lead to exactly the same set of possible
      allocations.  I will not show equivalent permutations for the
      rest of the examples but this gives the basic idea.

   2. OpenMPI job. Wants 32 processors with 8 processors per resource.

      <jsdl:Resources>
         <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs>
         ...
         <jsdl:Resource>
            <jsdl:CPUs><exact>8</exact></jsdl:CPUs>
            ...
         </jsdl:Resource>
      </jsdl:Resources>

   3. An OpenMP job. Wants 32 processors. Shared mem of course, so one
      resource.

      <jsdl:Resources>
         <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs>
         ...
         <jsdl:Resource>
            <jsdl:Count><exact>1</exact></jsdl:Count>
            ...
         </jsdl:Resource>
      </jsdl:Resources>

   4. A "homegrown" master/slave parallel job (say a ligand docking
      job). Wants 32 processors. No tiling constraints at all.

      <jsdl:Resources>
         <jsdl:TotalCPUs><exact>32</exact></jsdl:TotalCPUs>
         ...
         <jsdl:Resource>
            ...
         </jsdl:Resource>
      </jsdl:Resources>

   * Note that I'm specifically leaving out the Naregi "coupled
   simulation" use case (sorry guys), since we determined at the last
   GGF that it was a case which could be decomposed into multiple JSDL
   documents.

Of course, these use cases leave out many interesting "loose"
constraints where range-value is used to allow many possibly
heterogeneous solutions to the constraint system of the JSDL doc.

I would add that I think the "coupled simulation" use case can be
addressed by the resource constraints we are proposing but it is in
the "binding of application to resource" where multiple JSDL docs are
required.  For example, we can select heterogeneous resources if the
single application knows how to run each different "simulation" on
subsets of the allocation without expressing that binding in JSDL:

      <jsdl:Resources>
         ...
         <jsdl:Resource>
            <jsdl:Count><exact>8</exact></jsdl:Count>
            <jsdl:CPUs><exact>2</exact></jsdl:CPUs>
            ...
         </jsdl:Resource>
         <jsdl:Resource>
            <jsdl:Count><exact>16</exact></jsdl:Count>
            <jsdl:Count><exact>1</exact></jsdl:Count>
            ...
         </jsdl:Resource>
      </jsdl:Resources>

If we wanted to address such complex scenarios more elegantly, we
could allow more than one Resources element or even allow Resources
elements to recursively contain Resources elements in order to express
logical assemblies of resources which scope the "total" constraints
expressed within them as a summation of the contained assemblies or
"leaf" resource elements.  I do not know if anyone cares about this
sort of hierarchical resource map.

karl

-- 
Karl Czajkowski
karlcz@univa.com