Hi;
This email is intended to describe my views of the set of extension
mechanisms that are both necessary and sufficient to implement the common cases
that we have identified for the HPC profile work (see the document “HPC
Use Cases – Base Case and Common Cases”, a preliminary draft of
which I sent out to the ogsa-wg mailing list several weeks ago). These
views are in large part derived from ongoing discussions that Chris Smith and I
have been having about the subject of interoperable job scheduling designs.
This email is intended to start a discussion about extension mechanisms
rather than define the “answer” to this topic. So please do
reply with suggestions for any changes and extensions (J) you feel are
needed.
Marvin.
At a high level, there are two types of extensions that one might
consider:
·
Purely additive extensions.
·
Extensions that modify the
semantics of the underlying base-level design.
Purely additive extensions that, for example, add strictly new
functionality to an interface or that define additional resource types that clients
and schedulers can refer to, seem fairly straight-forward to support. Modifying
extensions fall into two categories:
·
Base case semantics remain
unchanged to parties operating at the base (i.e. un-extended) level.
·
Base case semantics change for
parties operating at the base level.
Modifying extensions that leave the base-level semantics unchanged are
straight-forward to incorporate. An example is adding at-most once
semantics to interface requests. These operations now have more tightly
defined failure semantics, but their functional semantics remain unchanged and
base-level clients can safely ignore the extended semantics.
Extensions that change base-level semantics should be disallowed since they
violate the fundamental premise of base-level interoperability. An
example of such an extension would be having the creation of jobs at a
particular (extended) scheduler require that the client issue an additional
explicit resource deallocation request once a job has terminated. Base-level
clients would not know to do this and the result would be an incorrectly
functioning system.
I believe the following types of extensions are both necessary and
sufficient to meet the needs of the HPC profile work:
·
Addition of new WSDL operations.
·
This is needed to support
additional new functionality, such as the addition of suspend/resume
operations. As long as base-level semantics aren’t modified, this
form of extension seems to be straight-forward.
·
Addition of additional parameters
to existing WSDL operations.
·
As long as base-level semantics
are maintained, this form of extension is also straight-forward. An
example is adding a notification callback parameter to job creation requests. However,
it is not clear whether all tooling can readily handle this form of “operation
overloading”. It may be better – from a pragmatic
point-of-view – to define new WSDL operations (with appropriately defined
names) that achieve the same effect.
·
Support for array operations and
other forms of batching.
·
When 1000’s of jobs are
involved the efficiency gains of employing array operations for things like
queries or abort requests are too significant to ignore. Hence a model in
which every job must be interacted with on a strictly individual basis via an
EPR is arguably unacceptable.
·
One approach would be to simply
add array operations alongside the corresponding individual operations, so that
one can selectively interact with jobs (as well as things like data files) in
either an “object-oriented” fashion or in “bulk-array”
fashion.
One could observe that the array operations enable the
corresponding individual operations as a trivial special case, but this would
arguably violate the principle of defining a minimalist base case and then
employing only extensions (rather than replacements).
·
Array operations are an example or
a service-oriented rather than a resource-oriented form of interaction: clients
send a single request to a job scheduler (service) that refers to an array of
many resources, such as jobs. This raises the question of whether things
like jobs should be referred to via EPRs or via unique “abstract” names
that are independent of any given service’s contact address. At a
high level, the choice is unimportant since the client submitting an array
operation request is simply using either one as a unique (and opaque) identifier
for the relevant resource. On a pragmatic level one might argue that abstract
names are easier and more efficient to deal with than EPRs since the receiving
scheduler will need to parse EPRs to extract what is essentially the abstract
name for each resource. (Using arrays of abstract names rather than
arrays of EPRs is also more efficient from a size point-of-view.)
·
If abstract names are used in
array operations then it will necessary that individual operations return the
abstract name and not just an EPR for a given resource, such as a job. If
this approach is chosen then this implies that the base case design and
implementation must return abstract names and not just EPRs for things like
jobs.
·
Extensions to state diagrams.
·
Chris Smith is in the process of
writing up this topic.
·
Standardized extensions to things
like resource definitions and other declarative definitions (e.g. about
provisioning).
·
The base use case assumes a small,
fixed set of “standard” resources and other concepts (e.g. working directory)
that may be described/requested. The simplest extension approach is to
define additional specific "standard sets" that clients and services
can refer to by their global name (e.g. the posix resource description set or
the Windows resource description set) and of which they pick exactly one to use
for any given interaction.
·
The problem with this simplest
form of extension is that it provides only a very crude form of extensibility
with no notion of composition or incremental extension of existing definition
sets. This is sufficient for very course-grained characterizations, such
as “Windows environment” versus “Posix environment”,
but not for finer-grained resource extensions. An alternative is to
define composable sets that cover specific “subjects” (e.g. GPUs). In
the extreme, these sets could be of size 1. This implies that clients and
services need to be able to deal with the power set of all possible meaningful combinations
of these sets. As long as individual definitions are independent of each
other (i.e. the semantics of specifying A is unchanged by specifying B in the
same description) this isn’t a big problem. Allowing the presence
of different items in a description to affect each other’s semantics is arguably
a variation on modifying the base-level semantics of a design via some
extension to the design and hence should be disallowed.
·
If resource descriptions are used
only for “matchmaking” against other resource descriptions then another
approach is to allow arbitrary resource types whose semantics are not
understood by the HPC infrastructure, which deals with them only as abstract
entities whose names can be compared textually and whose associated values can
be compared textually or numerically depending on their data type. It is
important to understand that, whereas the “mechanical” aspects of
an HPC infrastructure can mostly be built without having to know the semantics
of these abstract resource types, their semantics must still be standardized
and well-known at the level of the human beings using and programming the system.
Both the descriptions of available computational resources and of client
requests for reserving and using such resources must be specified in a manner
that will cause the underlying HPC “matchmaking” infrastructure to
do the right thing. This matchmaking approach is exemplified by systems
such Condor’s class ads system.
·
It should be noted that a
generalized matchmaking system is not a trivial thing to implement efficiently
and hence one can reasonably imagine extensions based on any of the above
approaches to extending resource (and other) definitions.
·
Hierarchical and extended representations
of information.
·
XML infosets provide a very
convenient way to represent extended descriptions of a particular piece of
information.
·
Another form of hierarchical
information display shows up when multi-level scheduling systems are involved. In
this case it may be desirable to represent information either in a form that hides
the scheduling hierarchy or in a form that reflects it. Consider how to
represent the list of compute nodes for a job running across multiple clusters:
A flat view might list all compute nodes in an undifferentiated list. A
hierarchical view might provide a list of clusters, each of which describes
information about a cluster, including a list of the compute nodes in that
cluster that the job is running on. Both views have their uses. XML
infosets are convenient for encoding the syntax of either view, but an
extension supporting information representation in these sorts of systems will
also have to define the semantics of all allowed hierarchies.
·
Decomposition of functionality
into “micro” protocols.
·
Micro protocols should reflect
things that must occur at different times (e.g. resource reservation/allocation
vs. resource use/job-execution) or that can be employed in a stand-alone manner
(e.g. job execution vs. data transfer). The decomposition that seems
relevant for the HPC use cases (i.e. are visible to clients) is the following:
·
The base case involves interaction
between a client and a scheduler for purposes of executing a job.
·
A client may wish to independently
reserve, or pre-allocate resources for later and/or
guaranteed use. Note that this is different from simply submitting a job
for execution to a scheduler that then queues the job for later execution –
perhaps at a specific time requested by the client. For example, a
meta-scheduler might wish to reserve resources so that it can make informed
scheduling decisions about which “subsidiary” scheduler to send
various jobs to. Similarly, a client might wish to reserve resources so
as to run two separate jobs in succession to each other, with one job writing
output to a scratch storage system and the second job reading that output as
its input without having to worry that the data might have vanished during the
interval that occurs between the execution of the two jobs.
·
A client may wish to query a
scheduler to learn what resources might be available to it, without actually laying
claim to any resources as part of the query (let alone execute anything using
those resources). Scheduling candidate set generators or matchmaking
services such as Condor would want this functionality.
·
A client may need to transfer
specific data objects (e.g. files) to and from a system that is under the
control of a job scheduling service.
·
Micro protocols may have
relationships to each other. For example, job execution will need to be
able to accept a handle of some sort to resources that have already been
allocated to the requesting client.