Thoughts on extensions mechanisms for the HPC profile work

29 Apr 2006

      Hi;

This email is intended to describe my views of the set of extension
mechanisms that are both necessary and sufficient to implement the
common cases that we have identified for the HPC profile work (see the
document "HPC Use Cases - Base Case and Common Cases", a preliminary
draft of which I sent out to the ogsa-wg mailing list several weeks
ago).  These views are in large part derived from ongoing discussions
that Chris Smith and I have been having about the subject of
interoperable job scheduling designs.

This email is intended to start a discussion about extension mechanisms
rather than define the "answer" to this topic.  So please do reply with
suggestions for any changes and extensions (:-)) you feel are needed.

Marvin.

Additive vs. modifying extensions

At a high level, there are two types of extensions that one might
consider:

*        Purely additive extensions.

*        Extensions that modify the semantics of the underlying
base-level design.

Purely additive extensions that, for example, add strictly new
functionality to an interface or that define additional resource types
that clients and schedulers can refer to, seem fairly straight-forward
to support.  Modifying extensions fall into two categories:

*        Base case semantics remain unchanged to parties operating at
the base (i.e. un-extended) level.

*        Base case semantics change for parties operating at the base
level.

Modifying extensions that leave the base-level semantics unchanged are
straight-forward to incorporate.  An example is adding at-most once
semantics to interface requests.  These operations now have more tightly
defined failure semantics, but their functional semantics remain
unchanged and base-level clients can safely ignore the extended
semantics.

Extensions that change base-level semantics should be disallowed since
they violate the fundamental premise of base-level interoperability.  An
example of such an extension would be having the creation of jobs at a
particular (extended) scheduler require that the client issue an
additional explicit resource deallocation request once a job has
terminated.  Base-level clients would not know to do this and the result
would be an incorrectly functioning system.

Types of extensions

I believe the following types of extensions are both necessary and
sufficient to meet the needs of the HPC profile work:

*        Addition of new WSDL operations.

*        This is needed to support additional new functionality, such as
the addition of suspend/resume operations.  As long as base-level
semantics aren't modified, this form of extension seems to be
straight-forward.

*        Addition of additional parameters to existing WSDL operations.

*        As long as base-level semantics are maintained, this form of
extension is also straight-forward.  An example is adding a notification
callback parameter to job creation requests.  However, it is not clear
whether all tooling can readily handle this form of "operation
overloading".  It may be better - from a pragmatic point-of-view - to
define new WSDL operations (with appropriately defined names) that
achieve the same effect.

*        Support for array operations and other forms of batching.

*        When 1000's of jobs are involved the efficiency gains of
employing array operations for things like queries or abort requests are
too significant to ignore.  Hence a model in which every job must be
interacted with on a strictly individual basis via an EPR is arguably
unacceptable.

*        One approach would be to simply add array operations alongside
the corresponding individual operations, so that one can selectively
interact with jobs (as well as things like data files) in either an
"object-oriented" fashion or in "bulk-array" fashion.

One could observe that the array operations enable the corresponding
individual operations as a trivial special case, but this would arguably
violate the principle of defining a minimalist base case and then
employing only extensions (rather than replacements).

*        Array operations are an example or a service-oriented rather
than a resource-oriented form of interaction: clients send a single
request to a job scheduler (service) that refers to an array of many
resources, such as jobs.  This raises the question of whether things
like jobs should be referred to via EPRs or via unique "abstract" names
that are independent of any given service's contact address.  At a high
level, the choice is unimportant since the client submitting an array
operation request is simply using either one as a unique (and opaque)
identifier for the relevant resource.  On a pragmatic level one might
argue that abstract names are easier and more efficient to deal with
than EPRs since the receiving scheduler will need to parse EPRs to
extract what is essentially the abstract name for each resource.  (Using
arrays of abstract names rather than arrays of EPRs is also more
efficient from a size point-of-view.)

*        If abstract names are used in array operations then it will
necessary that individual operations return the abstract name and not
just an EPR for a given resource, such as a job.  If this approach is
chosen then this implies that the base case design and implementation
must return abstract names and not just EPRs for things like jobs.

*        Extensions to state diagrams.

*        Chris Smith is in the process of writing up this topic.

*        Standardized extensions to things like resource definitions and
other declarative definitions (e.g. about provisioning).

*        The base use case assumes a small, fixed set of "standard"
resources and other concepts (e.g. working directory) that may be
described/requested.  The simplest extension approach is to define
additional specific "standard sets" that clients and services can refer
to by their global name (e.g. the posix resource description set or the
Windows resource description set) and of which they pick exactly one to
use for any given interaction.  

*        The problem with this simplest form of extension is that it
provides only a very crude form of extensibility with no notion of
composition or incremental extension of existing definition sets.  This
is sufficient for very course-grained characterizations, such as
"Windows environment" versus "Posix environment", but not for
finer-grained resource extensions.  An alternative is to define
composable sets that cover specific "subjects" (e.g. GPUs).  In the
extreme, these sets could be of size 1.  This implies that clients and
services need to be able to deal with the power set of all possible
meaningful combinations of these sets.  As long as individual
definitions are independent of each other (i.e. the semantics of
specifying A is unchanged by specifying B in the same description) this
isn't a big problem.  Allowing the presence of different items in a
description to affect each other's semantics is arguably a variation on
modifying the base-level semantics of a design via some extension to the
design and hence should be disallowed.

*        If resource descriptions are used only for "matchmaking"
against other resource descriptions then another approach is to allow
arbitrary resource types whose semantics are not understood by the HPC
infrastructure, which deals with them only as abstract entities whose
names can be compared textually and whose associated values can be
compared textually or numerically depending on their data type.  It is
important to understand that, whereas the "mechanical" aspects of an HPC
infrastructure can mostly be built without having to know the semantics
of these abstract resource types, their semantics must still be
standardized and well-known at the level of the human beings using and
programming the system.  Both the descriptions of available
computational resources and of client requests for reserving and using
such resources must be specified in a manner that will cause the
underlying HPC "matchmaking" infrastructure to do the right thing.  This
matchmaking approach is exemplified by systems such Condor's class ads
system.

*        It should be noted that a generalized matchmaking system is not
a trivial thing to implement efficiently and hence one can reasonably
imagine extensions based on any of the above approaches to extending
resource (and other) definitions.  

*        Hierarchical and extended representations of information.

*        XML infosets provide a very convenient way to represent
extended descriptions of a particular piece of information.

*        Another form of hierarchical information display shows up when
multi-level scheduling systems are involved.  In this case it may be
desirable to represent information either in a form that hides the
scheduling hierarchy or in a form that reflects it.  Consider how to
represent the list of compute nodes for a job running across multiple
clusters: A flat view might list all compute nodes in an
undifferentiated list.  A hierarchical view might provide a list of
clusters, each of which describes information about a cluster, including
a list of the compute nodes in that cluster that the job is running on.
Both views have their uses.  XML infosets are convenient for encoding
the syntax of either view, but an extension supporting information
representation in these sorts of systems will also have to define the
semantics of all allowed hierarchies.  

*        Decomposition of functionality into "micro" protocols.

*        Micro protocols should reflect things that must occur at
different times (e.g. resource reservation/allocation vs. resource
use/job-execution) or that can be employed in a stand-alone manner (e.g.
job execution vs. data transfer).  The decomposition that seems relevant
for the HPC use cases (i.e. are visible to clients) is the following:

*        The base case involves interaction between a client and a
scheduler for purposes of executing a job.  

*        A client may wish to independently reserve, or pre-allocate
resources for later and/or guaranteed use.  Note that this is different
from simply submitting a job for execution to a scheduler that then
queues the job for later execution - perhaps at a specific time
requested by the client.  For example, a meta-scheduler might wish to
reserve resources so that it can make informed scheduling decisions
about which "subsidiary" scheduler to send various jobs to.  Similarly,
a client might wish to reserve resources so as to run two separate jobs
in succession to each other, with one job writing output to a scratch
storage system and the second job reading that output as its input
without having to worry that the data might have vanished during the
interval that occurs between the execution of the two jobs.

*        A client may wish to query a scheduler to learn what resources
might be available to it, without actually laying claim to any resources
as part of the query (let alone execute anything using those resources).
Scheduling candidate set generators or matchmaking services such as
Condor would want this functionality.

*        A client may need to transfer specific data objects (e.g.
files) to and from a system that is under the control of a job
scheduling service.

*        Micro protocols may have relationships to each other.  For
example, job execution will need to be able to accept a handle of some
sort to resources that have already been allocated to the requesting
client.

Marvin Theimer

Karl Czajkowski

Marvin Theimer

Karl Czajkowski

Donal K. Fellows

Marvin Theimer

Donal K. Fellows

Karl Czajkowski

Marvin Theimer

Steve Loughran

Marvin Theimer

Balle, Susanne

Marvin Theimer

Donal K. Fellows

tags

participants (5)