Re: [jsdl-wg] Process Topology

20 Apr 2005

      On Apr 19, Christopher Smith loaded a tape reading:
...
If we need to specify different mechanisms for starting up tasks of a
parallel job a la the RSL jobType, then I'd like that to be separate from
the description of the resource allocation required.
For what it's worth, queuing systems like LSF/PBS/SGE don't handle this
startup phase (it's up to the job), so I'd like to see some example terms
describing job process topology (basically simple|multi|mpi use cases),
since I'm not too sure what they would look like, or what semantics would be
required.
Allocate "as a unit" just means that if I'm going to allocate any cpus from
a resource, I have to allocate "tileSize" cpus.
-- Chris
Well, I am struggling because I do not want to propose creeping
featurism for JSDL... if possible, I think the startup mechanism
should be left to extensions because it is such a rich and messy thing
as I will try to describe below.

What I am struggling to understand w.r.t. JSDL is whether there is
some aspect of job layout that is a meaningful part of the job
definition but not as simple as the resource topology stuff you were
discussing.  Because of the RSL legacy, I keep wanting to see some
generic concepts for process count etc. that are orthogonal to the
specific startup mechanism but which in essence parameterize both
allocation and job startup.

Perhaps if resource topology is precise enough, there is nothing more
needed?  Maybe a precise description of allocated resources defines
a "job shaped hole" into which an implied job topology would fit? :-)
The constellation of resource requirements and posix limits (and any
other extensions?) is what defines the virtual resource or "job shaped
hole" within which the executable is activated.

A practical runtime environment feature might be for a job system like
GRAM to expose a "resource map" in the form of JSDL resource syntax in
a file or environment variable so the job can introspect on the actual
allocation it received... this is a different but related
portability/interop problem for job execution systems when you include
runtime middleware in the executable. For example, if a future
MPICH-Gx release supports the dynamic task features of MPI, the
runtime implementation might require this sort of information from the
scheduler so it can work within its allocation?

karl

OK, here is the messy stuff I hope can remain somehow out of scope but
still feasible. Basically, GRAM is a higher-level job submission model
than what you describe for LSF/PBS/SGE where we try to provide a more
generic user-oriented job model instead of the very low-level "job
script" model of the local scheduler.

Job types in RSL are different activation methods: 

  single: one instance of executable is activated and it must,
     through site-specific means, do whatever else is needed;
     for example, read a scheduler-specific HOSTS file and use
     some site-specific launch mechanism to start tasks on each
     allocated host.

  multi: all "count" instances of executable are activated so
     the job needs not do anything but calculate; for example,
     GRAM generates a job script that does the site-specific stuff
     described in "single".

  mpirun: the parameters are mapped through to an 'mpirun' invocation
     to launch the runtime required for the job.  in practice,
     I think this is a wrapped form of "single" where the user
     executable is mapped to an argument and the scheduled executable
     is mpirun.  but I'd need to check to be sure.

  condor: the job is submitted to a condor flock, if my memory serves
     me.

missing is a funny SMP-aware hybrid one can imagine:

  spreadsingle: one instance of executable is activated on each
     allocated resource but it must start additional parallel tasks
     itself if it wants parallelism on a resource.  so, handle
     site-specific resource activation for job, but leave job to
     "expand" on each host (node).

-- 
Karl Czajkowski
karlcz@univa.com