RE: [ogsa-wg] Thoughts on extensions mechanisms for the HPC profile work

1 May 2006

      Hi;

You're right that I wasn't thinking about the most general cases that
can occur in workflow and parallel/distributed programs, such as MPI or
UPC programs.  In particular, I was only thinking about things that can
be declaratively described and don't require the provision of code that
needs to run "inside" the scheduling infrastructure.

Let's consider workflow and job dependencies first.  Static workflows
and job dependencies can be declaratively described in the form of XML
infosets using standardized terminology.  For these I claim that the
extension mechanisms I've described are sufficient.  That is, supporting
static workflows and job dependencies is a matter of defining the
appropriate standardized description syntax and semantics and supporting
extended versions of such is a matter of agreeing on how extensions to
the descriptions can be made (which are covered by the mechanisms I've
already included).

Dynamic workflows and job dependencies require that a client supply
application-specific code that can be run inside the scheduling
infrastructure in order to supply dynamically computed decisions.  This
requires an additional extension mechanism beyond the ones that I
listed.  Note that whether the client describes the decision making code
in terms of something like BPEL or in terms of something like a Java
servlet that the scheduling infrastructure runs is a second-order issue.
In both cases the client is supplying code that gets run inside the
scheduling infrastructure.

So, you are right that we need the ability to run client-supplied code
inside the scheduling infrastructure for some of the extensions we might
contemplate.  That said, I would argue that we should save these kinds
of extensions for later in our deliberations since they are far more
complicated to get right than the ones I listed.  But we should
definitely keep them in mind.

Regarding MPI/UPC and other forms of parallel/distributed programs (e.g.
PVM):  There is a declarative aspect that is visible to clients and an
internal, implementation aspect that I would argue should not be visible
in the interface between clients and schedulers.  Let's consider an MPI
program that is based on the MPICH infrastructure consisting of SMPD
daemons running on each compute node used by the program.  A client will
specify the MPI program to run, which MPI infrastructure it expects, and
the relevant MPI-related arguments to supply (as well as other
arguments, environment variables, etc.).  This can all be described and
encoded as an XML infoset.  The scheduling infrastructure will
internally need to implement the MPICH SMPD daemon-based infrastructure,
but the details of that aren't visible in the job scheduling interface.

An interesting question is whether MPICH's implementation aspects need
to become visible when we consider scheduler-scheduler interactions.  If
an MPI program can span multiple clusters then the relevant SMPD daemons
from multiple clusters need to be put in touch with each other.  In the
case of MPICH, I believe the main thing needed is that the root SMPD
daemon receive a list of the IP addresses of all the compute nodes that
will participate in a given MPI program.  In that case, the
server-server aspects of scheduling an MPI program mainly have to do
with allocating the appropriate number of compute nodes - and getting
their names - from appropriate compute clusters that have indicated that
they support the MPICH SMPD infrastructure.  So I hypothesize that
support for parallel/distributed programs is mainly a matter of defining
the appropriate declarative standards and doesn't require any additional
extension mechanisms beyond those I've already described.  I would, of
course, be very interested to learn of examples where this is not
enough.

Marvin.

________________________________

From: Balle, Susanne [mailto:Susanne.Balle@hp.com] 
Sent: Monday, May 01, 2006 7:42 AM
To: Marvin Theimer
Cc: ogsa-wg@ggf.org
Subject: RE: [ogsa-wg] Thoughts on extensions mechanisms for the HPC
profile work

Marvin

I think this is a good start. I did find some areas missing such as
Workflow and support for Job Dependencies as well as extensions for
MPI/UPC programs.

I like the "object-oriented" approach and agree with Dave  that being
able to specify more complex expressions is important and is in my
opinion a requirement for ease of use. If you launch a 1000 job you want
to be able to do query on groups of jobs without having to specify the
individual jobs.

Regards Susanne

-----Original Message-----
From: owner-ogsa-wg@ggf.org [mailto:owner-ogsa-wg@ggf.org] On Behalf Of
Marvin Theimer
Sent: Friday, April 28, 2006 10:06 PM
To: ogsa-wg@ggf.org
Subject: [ogsa-wg] Thoughts on extensions mechanisms for the HPC profile
work

	Hi;

	This email is intended to describe my views of the set of
extension mechanisms that are both necessary and sufficient to implement
the common cases that we have identified for the HPC profile work (see
the document "HPC Use Cases - Base Case and Common Cases", a preliminary
draft of which I sent out to the ogsa-wg mailing list several weeks
ago).  These views are in large part derived from ongoing discussions
that Chris Smith and I have been having about the subject of
interoperable job scheduling designs.

	This email is intended to start a discussion about extension
mechanisms rather than define the "answer" to this topic.  So please do
reply with suggestions for any changes and extensions (:-)) you feel are
needed.

	Marvin.

	Additive vs. modifying extensions

	At a high level, there are two types of extensions that one
might consider:

	1.      *        Purely additive extensions.

	2.      *        Extensions that modify the semantics of the
underlying base-level design.

	Purely additive extensions that, for example, add strictly new
functionality to an interface or that define additional resource types
that clients and schedulers can refer to, seem fairly straight-forward
to support.  Modifying extensions fall into two categories:

	3.      *        Base case semantics remain unchanged to parties
operating at the base (i.e. un-extended) level.

	4.      *        Base case semantics change for parties
operating at the base level.

	Modifying extensions that leave the base-level semantics
unchanged are straight-forward to incorporate.  An example is adding
at-most once semantics to interface requests.  These operations now have
more tightly defined failure semantics, but their functional semantics
remain unchanged and base-level clients can safely ignore the extended
semantics.

	Extensions that change base-level semantics should be disallowed
since they violate the fundamental premise of base-level
interoperability.  An example of such an extension would be having the
creation of jobs at a particular (extended) scheduler require that the
client issue an additional explicit resource deallocation request once a
job has terminated.  Base-level clients would not know to do this and
the result would be an incorrectly functioning system.

	Types of extensions

	I believe the following types of extensions are both necessary
and sufficient to meet the needs of the HPC profile work:

	5.      *        Addition of new WSDL operations.

	6.      *        This is needed to support additional new
functionality, such as the addition of suspend/resume operations.  As
long as base-level semantics aren't modified, this form of extension
seems to be straight-forward.

	7.      *        Addition of additional parameters to existing
WSDL operations.

	8.      *        As long as base-level semantics are maintained,
this form of extension is also straight-forward.  An example is adding a
notification callback parameter to job creation requests.  However, it
is not clear whether all tooling can readily handle this form of
"operation overloading".  It may be better - from a pragmatic
point-of-view - to define new WSDL operations (with appropriately
defined names) that achieve the same effect.

	9.      *        Support for array operations and other forms of
batching.

	10.  *        When 1000's of jobs are involved the efficiency
gains of employing array operations for things like queries or abort
requests are too significant to ignore.  Hence a model in which every
job must be interacted with on a strictly individual basis via an EPR is
arguably unacceptable.

	11.  *        One approach would be to simply add array
operations alongside the corresponding individual operations, so that
one can selectively interact with jobs (as well as things like data
files) in either an "object-oriented" fashion or in "bulk-array"
fashion.

	One could observe that the array operations enable the
corresponding individual operations as a trivial special case, but this
would arguably violate the principle of defining a minimalist base case
and then employing only extensions (rather than replacements).

	12.  *        Array operations are an example or a
service-oriented rather than a resource-oriented form of interaction:
clients send a single request to a job scheduler (service) that refers
to an array of many resources, such as jobs.  This raises the question
of whether things like jobs should be referred to via EPRs or via unique
"abstract" names that are independent of any given service's contact
address.  At a high level, the choice is unimportant since the client
submitting an array operation request is simply using either one as a
unique (and opaque) identifier for the relevant resource.  On a
pragmatic level one might argue that abstract names are easier and more
efficient to deal with than EPRs since the receiving scheduler will need
to parse EPRs to extract what is essentially the abstract name for each
resource.  (Using arrays of abstract names rather than arrays of EPRs is
also more efficient from a size point-of-view.)

	13.  *        If abstract names are used in array operations
then it will necessary that individual operations return the abstract
name and not just an EPR for a given resource, such as a job.  If this
approach is chosen then this implies that the base case design and
implementation must return abstract names and not just EPRs for things
like jobs.

	14.  *        Extensions to state diagrams.

	15.  *        Chris Smith is in the process of writing up this
topic.

	16.  *        Standardized extensions to things like resource
definitions and other declarative definitions (e.g. about provisioning).

	17.  *        The base use case assumes a small, fixed set of
"standard" resources and other concepts (e.g. working directory) that
may be described/requested.  The simplest extension approach is to
define additional specific "standard sets" that clients and services can
refer to by their global name (e.g. the posix resource description set
or the Windows resource description set) and of which they pick exactly
one to use for any given interaction.  

	18.  *        The problem with this simplest form of extension
is that it provides only a very crude form of extensibility with no
notion of composition or incremental extension of existing definition
sets.  This is sufficient for very course-grained characterizations,
such as "Windows environment" versus "Posix environment", but not for
finer-grained resource extensions.  An alternative is to define
composable sets that cover specific "subjects" (e.g. GPUs).  In the
extreme, these sets could be of size 1.  This implies that clients and
services need to be able to deal with the power set of all possible
meaningful combinations of these sets.  As long as individual
definitions are independent of each other (i.e. the semantics of
specifying A is unchanged by specifying B in the same description) this
isn't a big problem.  Allowing the presence of different items in a
description to affect each other's semantics is arguably a variation on
modifying the base-level semantics of a design via some extension to the
design and hence should be disallowed.

	19.  *        If resource descriptions are used only for
"matchmaking" against other resource descriptions then another approach
is to allow arbitrary resource types whose semantics are not understood
by the HPC infrastructure, which deals with them only as abstract
entities whose names can be compared textually and whose associated
values can be compared textually or numerically depending on their data
type.  It is important to understand that, whereas the "mechanical"
aspects of an HPC infrastructure can mostly be built without having to
know the semantics of these abstract resource types, their semantics
must still be standardized and well-known at the level of the human
beings using and programming the system.  Both the descriptions of
available computational resources and of client requests for reserving
and using such resources must be specified in a manner that will cause
the underlying HPC "matchmaking" infrastructure to do the right thing.
This matchmaking approach is exemplified by systems such Condor's class
ads system.

	20.  *        It should be noted that a generalized matchmaking
system is not a trivial thing to implement efficiently and hence one can
reasonably imagine extensions based on any of the above approaches to
extending resource (and other) definitions.  

	21.  *        Hierarchical and extended representations of
information.

	22.  *        XML infosets provide a very convenient way to
represent extended descriptions of a particular piece of information.

	23.  *        Another form of hierarchical information display
shows up when multi-level scheduling systems are involved.  In this case
it may be desirable to represent information either in a form that hides
the scheduling hierarchy or in a form that reflects it.  Consider how to
represent the list of compute nodes for a job running across multiple
clusters: A flat view might list all compute nodes in an
undifferentiated list.  A hierarchical view might provide a list of
clusters, each of which describes information about a cluster, including
a list of the compute nodes in that cluster that the job is running on.
Both views have their uses.  XML infosets are convenient for encoding
the syntax of either view, but an extension supporting information
representation in these sorts of systems will also have to define the
semantics of all allowed hierarchies.  

	24.  *        Decomposition of functionality into "micro"
protocols.

	25.  *        Micro protocols should reflect things that must
occur at different times (e.g. resource reservation/allocation vs.
resource use/job-execution) or that can be employed in a stand-alone
manner (e.g. job execution vs. data transfer).  The decomposition that
seems relevant for the HPC use cases (i.e. are visible to clients) is
the following:

	26.  *        The base case involves interaction between a
client and a scheduler for purposes of executing a job.  

	27.  *        A client may wish to independently reserve, or
pre-allocate resources for later and/or guaranteed use.  Note that this
is different from simply submitting a job for execution to a scheduler
that then queues the job for later execution - perhaps at a specific
time requested by the client.  For example, a meta-scheduler might wish
to reserve resources so that it can make informed scheduling decisions
about which "subsidiary" scheduler to send various jobs to.  Similarly,
a client might wish to reserve resources so as to run two separate jobs
in succession to each other, with one job writing output to a scratch
storage system and the second job reading that output as its input
without having to worry that the data might have vanished during the
interval that occurs between the execution of the two jobs.

	28.  *        A client may wish to query a scheduler to learn
what resources might be available to it, without actually laying claim
to any resources as part of the query (let alone execute anything using
those resources).  Scheduling candidate set generators or matchmaking
services such as Condor would want this functionality.

	29.  *        A client may need to transfer specific data
objects (e.g. files) to and from a system that is under the control of a
job scheduling service.

	30.  *        Micro protocols may have relationships to each
other.  For example, job execution will need to be able to accept a
handle of some sort to resources that have already been allocated to the
requesting client.