Hi;
A couple more questions:
·
Although sec. 8 says that the
interface for controlling/managing an existent activity is out-of-scope, it
clearly overlaps with the BES interface (in particular, the query and cancel
operations). Given this overlap I’m wondering how much of the
activity’s WSDL interface will effectively be redundant with the BES
interface. Moreover, services wanting to support array operations on
activities will effectively need to support the full activity WSDL in any case
since otherwise it will be impossible to achieve the desired batching for those
operations.
·
This leads me to wonder whether a
separate WSDL for activity interaction is really appropriate since it will
require that the two specifications be kept continuously synchronized and one
will effectively be a strict subset of the other.
This issue seems like one that is more general than just HPC. I’m
curious to hear what other people have to say about it.
Marvin.
From: Marvin Theimer
Sent: Monday, June 05, 2006 4:40
PM
To: 'ogsa-bes-wg@ggf.org'
Cc:
Subject: RE: Questions and
potential changes to BES, as seen from HPC Profile point-of-view
Hi;
One point of clarification on the straw man proposal: For operations
such as
·
GetActivity(EPR) à
activityState
·
GetActivityProvenance(EPR) à
either JSDL doc (if that can describe all the necessary provenance info) or
JSDL+
·
CancelActivity(EPR)
The EPR is a parameter supplied to the BES service, it is NOT the
endpoint to which the request is being sent. (After all, that EPR is a
different WSDL altogether.)
In general, this raises the question of whether it would be useful to
have a more compact “activityID” (akin to the abstract name in a
WS-Name) that can be used instead of a fairly bulky EPR that the BES service
then needs to parse in order to obtain the activityID that is somewhere
embedded in the EPR.
Marvin.
From: Marvin Theimer
Sent: Monday, June 05, 2006 4:19
PM
To: ogsa-bes-wg@ggf.org
Cc: Marvin Theimer;
Subject: Questions and potential
changes to BES, as seen from HPC Profile point-of-view
Hi;
Coming from the point-of-view of the HPC Profile working group, I have
several questions about BES (including recent discussions on the mailing list),
as well as some straw man thoughts about how BES should relate to the HPC
profile spec.
Based on the BES-1.3 spec that Andrew Grimshaw recently sent out, at an
abstract level, there seem to be the following aspects to BES:
·
A core set of operations around
activities:
·
CreateActivityFromJSDL
·
GetActivityStatus
·
RequestActivityStateChange
·
GetActivityJSDLDocuments
·
A set of BES factory-specific
system management operations and resource properties (RPs):
·
StartAcceptingNewActivities
·
StopAcceptingNewActivities
·
IsAcceptingNewActivities RP
·
Support for notifications.
·
Support for various resource
properties (or their equivalent in a non-WSRF version) having to do with an
information model for describing various things about a BES factory, the
associated container it represents, and any activities it is currently running.
·
An extensible activity state
model.
Things explicitly NOT in the BES specification are:
·
Generic system management
interface.
·
Security design.
·
Interface for directly
controlling/manipulating an activity once it has been created.
Things that used to be in the BES spec but now seem to be extensions (please correct me if I’m wrong here!):
·
Data staging
·
Suspension
I have the following questions about BES and the various discussions
that have recently occurred (including the ESI integration):
·
Extensibility:
·
Given that BES has bought into the
notion of an extensible activity state diagram, it needs to also normatively
define how clients can learn of the extensions that a given BES service
supports. Is that something that will be added to the BES specification?
Or will the specification point to some other place where notions of
extensibility are defined more generically? (Personally, I’d vote
for the former approach.)
·
Is the “base case” for
BES now fig.2, which shows states of {new, pending, running, canceled, failed,
finished}?
·
Previously included states, such
as Execution-Pending, will presumably be defined in suitable extension
profiles?
·
Assuming that data staging and
suspension are now extensions to the base BES spec, will they be defined as
such in an appendix of the spec, or as a separate extension profile?
·
The original BES spec describes a
fairly sophisticated data staging design that supports parallelism. Is
there any interest in defining a second, simpler data staging extension that
avoids the complexity of the parallelism support?
·
Will the suspension extension be
the simple one that is currently presented in sec. 4 as an example? Or do
people feel that a more complicated version, such as the ESI one is
necessary/important? Can/should we define both?
·
Given that suspension is no longer
in the base design, presumably the createInSuspendedState parameter to CreateActivityFromJSDL
should disappear?
·
RequestActivityStateChange: I
believe this operation will pose challenges in an extensible design. The
current design is imperative by nature: it specifies an explicit state to move
an activity to. However, a client who does not know of all the extensions
that a BES service implements may not know how to pick the appropriate state to
transition to. It seems better to introduce a more declarative approach
in which clients specify “actions” they wish to occur, such as ‘CancelActivity’.
This approach would allow the BES service to make the appropriate state
transition in response to a desired action requested by a client.
·
Information model:
·
JSDL seems to inherently be
focused on describing a single job or a single computational resource.
For example, it has no notion of describing all the differing compute nodes of
a (heterogeneous) compute cluster. By incorporating JSDL elements into
the BES information model it seems that BES is foreclosing the ability to
describe things like compute clusters. This issue also effects what can
get returned from GetActivityJSDLDocuments. If I’m wrong about
this, then it seems like it would be worth having an explicit explanation about
how to achieve this functionality somewhere in the specification.
·
The BES information model now
includes various posix-specific elements of JSDL. How would other systems
– such as a Windows system – be described?
·
The spec requires that all BES
services “support” all the various attributes listed in sec. 5, but
they don’t have to implement them. What exactly does that
mean? For example, if a JSDL doc specifies a CPU-Speed requirement and a
particular BES service doesn’t implement it (meaning it doesn’t
keep track of it), then does the associated CreateActivityFromJSDL request have
to fail? If so, then do clients have to figure out what the minimal set
of implemented attributes are in a system and then only use those in job
descriptions? Is there is a notion of “optional” attributes
that can be ignored, that specify desired attribute values rather than required
ones?
·
Is there any notion of specifying
that all compute nodes should have the same
value for some attribute (e.g. CPU architecture, CPU speed, NIC card)?
This seems to be missing from the JSDL specification, but seems very
important for BES if it is to support things like compute clusters.
·
Some of the elements seem either
incompletely specified, have definitions that are open to multiple
interpretations, or have definitions that would be very difficult to implement
in practice. In particular:
·
CPU architecture seems like it
can’t describe all the variations – let alone all the peripherals
such as GPUs – that a computing resource might have (let alone a
cluster).
·
CPU speed seems like the tip of an
iceberg having to do with characterizing the performance of a system, which
will depend on all manner of things like details of the processor chip used,
cache sizes, bus used, etc.
·
Network bandwidth: is this the theoretical
maximum of the NIC on a compute node or is it the current bandwidth actually
available in a (shared) system? Note that the latter is difficult to
measure in a practically useful way. Note also that network bandwidth
only describes one aspect of communications performance and that several others
are arguably equally important (e.g. latency).
All this leads to the question of whether BES will
have a notion of extending the information model that is supplied. If so, then
that leads to the question of what the base case should be and whether it
should include a smaller set of things than is currently listed in the spec.
Are there any plans to tighten the definitions of some
of the more vague information elements? (I guess this really is an issue
more for the JSDL WG than for BES.)
·
GetActivityJSDLDocuments returns a
JSDL document for each specified activity. Is this sufficient to capture
the entire “provenance” for what has happened to the activity?
In particular, would it be sufficient to allow someone to (a) run the
same activity on another BES service (assuming same hardware and software) and
get the same results and (b) debug what has happened to an errant
activity? I would argue that both capabilities have proven to be
important in actual systems.
·
System management operations:
·
Currently BES supports 2 specific
system management operations: Start and stop activities commands. Most
schedulers support a variety of scheduling-specific system management
operations and I’m wondering why these two operations were singled out in
particular to be part of the base case?
·
These operations seem to require a
different set of authorization credentials than the other interface operations
since they should be invoked by system administrators rather than random users.
How will that work, given that these operations are in the same WSDL as
the other operations? Wouldn’t this argue for moving these
operations to a separate system management interface?
·
Array operations:
·
Currently one can create a single
activity, but all other operations accept an array of AEDs as input. Was
there some reason why an array creation operation wasn’t included so
that, for example, parameter sweep applications can be created with a single
request instead of N requests (where N can be in the thousands)?
·
Given that BES seems to have
bought into the notion of extensibility, should the base case be a
“non-array” one? For example, currently if you want to handle
a fault for a RequestActivityStateChange operation on a single activity you need
to look inside the returned array of results to see if a fault infoset was
returned. All the exception handling machinery that modern tooling
provides can’t get used because RequestActivityStateChange never returns
an actual fault message (as compared to a fault infoset for the appropriate
array elements that are returned.
·
Other questions:
·
An entire (small) section is
devoted to talking about the optional use of WS-Names. However, since the
specification doesn’t require
them, it’s unclear to me whether BES needs to say anything about
WS-Names. As far as I understand things, whether an EPR is a WS-Name or
not can be determined by inspecting it. Hence the only reason to have a
special property on a BES service that indicates what kind of AEDs it returns
is to alert potential clients ahead of time about this feature of the service.
But it’s not clear to me what a client would do with that
information, as compared to deciding opportunistically to exploit a WS-Name AED
for, e.g. resolution, at the time that that would be necessary. Is there
a use case that describes how clients would exploit the AED-type resource
property?
·
Since JSDL documents are
self-describing, a BES service can figure out by inspection whether the job
description infoset parameter to CreateActivityFromJSDL is JSDL or something
else. This would seem to imply that naming the operation CreateActivity
would lose no information and would allow for transparent extension to other
job description infoset simply by using them (assuming they are
self-describing).
·
Container attributes that I have
questions about:
·
LocalResourceManagerType: where do
these get defined normatively?
·
Job Credential Service and File
Credential Service: these imply a specific security model. Given
that security is undefined in the BES spec, is this appropriate –
especially given the rather vague definition of both?
Given these questions, as well as the mandate for the HPC profile to
define a simple base interface, I would like to present the following straw man
proposal for a modified BES specification for feedback from this community:
·
Operations:
·
CreateActivity(jsdlDoc) à EPR
·
GetActivity(EPR) à
activityState
·
GetActivityProvenance(EPR) à
either JSDL doc (if that can describe all the necessary provenance info) or
JSDL+
·
CancelActivity(EPR)
·
For non-WSRF versions:
QueryResources() à
schedulerResourcesInfoset
·
‘schedulerResourcesInfoset’
is essentially the union of the RPs that would be exported in a WSRF-based
version for describing the resources that are available for use at this BES
service. Note that a BES service might also want to expose other kinds of
information that would not be returned from this operation – this
operation is there so that clients can determine whether or not a BES service
could potentially meet their needs and is necessary for meta-scheduling
scenarios.
·
One might argue that one could use
WS-Transfer for this operation. However, since a BES service might want
to export other kinds of information, this would require an extra level of
indirection so that the BES service could expose which EPRs to use for
retrieving which kinds of information.
·
Additional topics/summary:
·
Simple state diagram and no notion
of array operations, data staging, suspension, or notifications in base BES
case.
·
Extensions defined as separate
profiles for array operations, data staging, suspension, and notifications.
·
RequestActivityStateChange
replaced by operations specifying desired actions rather than states.
Base case supports activity cancellation; extensions can define
additional operations (e.g. SuspendActivity).
·
Information model: small base set
plus extensions model (which ones to include in the base set TBD)
·
All system management functions
moved out to a separate interface.
Thanks for any and all feedback on these questions and this straw man
proposal,
Marvin.