Chris:
Trying to keep things concrete rather than philosophical, are you
asserting on behalf of Platform that at-most-once execution semantics are
not important for your customers? They don't care that there is no way of
telling whether a job was submitted successfully or not?
I'd like to emphasize that this is not at all hard to do, and is standard
distributed computing practice: we've had it in Globus for many years,
it's just a question of including a sequence number in each request (for
example). I must admit I'm puzzled why there is this strong reaction to a
simple feature.
Ian.
At 11:40 AM 3/21/2006 -0800, Christopher Smith wrote:
Ill just add
(in a me tootype of answer) that from the point of view of Platform, if
the interface proposed doesnt map very well to our existing capabilities,
the chances of getting it implemented start getting lower, since the
amount of time we need to take to develop these capabilities gets larger.
This is the standard approach in a software company ... I dont think Im
stating anything new here.
So we have lots of customers, and lots of ISV partners, for whom these
extended capabilities are not really that important. They generally just
want simple interfaces that reflect the capabilities they are used to
seeing in their in production (for many years) middleware
stacks.
So I vote for the simple case. But lets make sure that its extensible so
that we can add extended capabilities easily in future iterations. To me
standards are part of the evolution towards interoperability, not the end
goal in of themselves.
-- Chris
On 21/3/06 10:28, "Marvin Theimer"
<theimer@microsoft.com> wrote:
- Hi;
-
- Whereas I agree with you that at-most-once semantics are very
desirable, I would like to point out that not all existing job schedulers
implement them. I know that both LSF and CCS (the Microsoft HPC job
scheduler) dont. Ive been trying to find out whether PBS and SGE do
or dont.
-
- So, this brings up the following slightly more general question:
should the simplest base case be the simplest case that does something
useful, or should it be more complicated than that? I can see good
arguments on both sides:
- ·
Whittling things down to the simplest
possible base case maximizes the likelihood that parties can
participate. Every feature added represents one more feature that
some existing system may not be able to support or that a new system has
to provide even when its not needed in the context of that system.
Suppose, for example, that PBS and SGE dont provide transactional
semantics of the type you described. Then 4 of the 6 most common
job scheduling systems would not have this feature and would need to
somehow add it to their implementations. In this particular case it
might be too difficult to add in practice, but in general there might be
problems.
- ·
On the other hand, since there are many
clients and arguably far fewer server implementations, features that
substantially simplify client behavior/programming and that are not too
onerous to implement in existing and future systems should be part of the
base case. The problem, of course, is that this is a slippery slope
at the end of which lies the number 42 (ignore that last phrase if youre
not a fan of The Hitchhikers Guide to the Galaxy).
- Personally, the slippery slope argument makes me lean towards
defining the simplest possible base use case, since otherwise well spend
a (potentially very) long time arguing about which features are important
enough to justify being in the base case. One possible way forward
on this issue is to have people come up with lists of features that they
feel belong in the base use case and then we agree to include only those
that have a large majority of the community arguing for their inclusion
in the base case.
-
- Unfortunately defining what large majorityshould be is also not easy
or obvious. Indeed, one can argue that we cant even afford to let
all votes be equal. Consider the following hypothetical (and
contrived) case: 100 members of a particular academic research community
show up and vote that the base case must include support for a particular
complicated scheduling policy and the less-than-ten suppliers of existing
job schedulers with significant numbers of users all vote against
it. Should it be included in the base case? What happens if
the major scheduler vendors/suppliers decide that they cant justify
implementing it and therefore cant be GGF spec-compliant and therefore go
off and define their own job scheduling standard? The hidden issue
is, of course, whether those voting are representative of the overall HPC
user population. I cant personally answer that question, but it
does again lead me to want to minimize the number of times I have to ask
that question i.e. the number of features that I have to consider for
inclusion in the base case.
-
- So this brings me to the question of next steps. Recall that
the approach Im advocating and that others have bought in to as far as I
can tell is that we define a base case and the mechanisms and approach to
how extensions of the base case are done. I assert that the
absolutely most important part of defining how extension should work is
ensuring that multiple extensions dont end up producing a hairball thats
impossible to understand, implement, or use. In practice this means
coming up with a restricted form of extension since history is pretty
clear on the pitfalls of trying to support arbitrarily general extension
schemes.
-
- This is one of the places where identification of common use cases
comes in. If we define the use cases that we think might actually
occur then we can ask whether a given approach to extension has a
plausible way of achieving all the identified use cases. Of course,
future desired use cases might not be achievable by the extension schemes
we come up with now, but that possibility is inevitable given anything
less than a fully general extension scheme. Indeed, even among the
common use cases we identify now, we might discover that there are
trade-offs where a simpler (and hence probably more understandable and
easier to implement and use) extension scheme can cover 80% of the use
cases while a much more complicated scheme is required to cover 100% of
the use cases.
-
- Given all this, here are the concrete next steps Id like to
propose:
- ·
Everyone who is participating in this design
effort should define what they feel should be the HPC base use
case. This represents the simplest use case and associated features
like transactional submit semantics that you feel everyone in the HPC
grid world must implement. We will take these use case candidates
and debate which one to actually settle on.
- ·
Everyone should define the set of HPC use
cases that they believe might actually occur in practice. I will
refer to these as the common use cases, in contrast to the base use case.
The goal here is not to define the most general HPC use case, but rather
the more restricted use cases that might occur in real life. For
example, not all systems will support job migration, so whereas a fully
general HPC use case would include the notion of job migration, I argue
that one or more common use cases will not include job
migration.
- Everyone should also
prioritize and rank their common use cases so that we can discuss
80/20-style trade-offs concerning which use cases to support with any
given approach to extension. Thus prioritization should include the
notion of how common you think a use case will actually be, and hence how
important it will be to actually support that use case.
- ·
Everyone should start thinking about what
kinds of extension approaches they believe we should define, given the
base use case and common use cases that they have identified.
- As multiple people have pointed out, an exploration of common HPC use
cases has already been done one or several times before, including in the
EMS working group. Im still catching up on reading GGF documents,
so I dont know how much those prior efforts explored the issue from the
point-of-view of base case plus extensions. If these prior
explorations did address the topic of base-plus-extensions and you agree
with the specifics that were arrived at then this exercise will be a
quick-and-easy one for you: you can simply publish the appropriate links
to prior material in an email to this mailing list. I will
personally be sending in my list independent of prior efforts in order to
provide a newcomersperspective on the subject. It will interesting
to see how much overlap there is.
-
- One very important point that Id like to raise is the following: Time
is short and bestis the enemy of good enough. Microsoft is planning to
provide a Web services-based interoperability interface to its job
scheduler sometime in the next year or two. I know that many of the
other job scheduler vendors/suppliers are also interested in having an
interoperability story in place sooner rather than later. To meet
this schedule on the Microsoft side will require locking down a first
fairly complete draft of whatever design will be shipped by essentially
the end of August. That's so that we can do all the necessary
debugging, interoperability testing, security threat modeling, etc. that
goes with shipping an actual finished product. What that means for
the HPC profile work is that, come the end of August, Microsoft and
possibly other scheduler vendors/suppliers will need to lock down and
start coding some version of Web Services-based job scheduling and data
transfer protocols. If there is a fairly well-defined, feasible set
of specs/profile coming out of the GGF HPC working group (for
recommendation NOT yet for actual standards approval) that has some
reasonable level of consensus by then, then that's what Microsoft will
very likely go with. Otherwise Microsoft will need to defer the
idea of shipping anything that might be GGF compliant to version 3 of our
product, which will probably ship about 4 years from now.
-
- The chances of coming up with the bestHPC profile by the end of
August are slim. The chances of coming up with a fairly simple
design that is good enoughto cover the most important common cases by
means of a relatively simple, restricted form of extension seems much
more feasible. Covering a richer set of use cases would need to be
deferred to a future version of the profile, much in the manner that BES
has been defined to cover an important sub-category of use cases now,
with a fuller EMS design being done in parallel as future work. So
I would argue that perhaps the most important thing this design effort
and the planned HPC profile working group that will be set up in Tokyo
can do is to identify what a good enoughversion 1 HPC profile should
be.
-
- Marvin.
-
- From: Carl Kesselman
[mailto:carl@isi.edu]
- Sent: Thursday, March 16, 2006 12:49 AM
- To: Marvin Theimer
- Cc: humphrey@cs.virginia.edu; ogsa-wg@ggf.org
- Subject: Re: [ogsa-wg] Paper proposing "evolutionary
vertical design efforts"
- Hi,
- In the interest of furthering agreement, I was not arguing that the
application had to be restartable. Rather, what has been shown to be
important is that the protocol be restartable in the following
sense: if you submit a job and the far and server fails, is the job
running or not, if you resubmit, do you get another job instance. The GT
sumbission protocol and Condor have a transactional semantics so that you
can have at most once submit semantics reegardless of client and server
failures. The fact that your application may be non-itempote is exactly
why having a well defined semantics in this case is important.
- So what is the next step?
- Carl
- Dr. Carl
Kesselman
email: carl@isi.edu
- USC/Information Sciences
Institute WWW:
http://www.isi.edu/~carl
- 4676 Admiralty Way, Suite
1001 Phone:
(310) 448-9338
- Marina del Rey, CA
90292-6695
Fax: (310) 823-6714
- -----Original Message-----
- From: Marvin Theimer <theimer@microsoft.com>
- To: Carl Kesselman <carl@isi.edu>
- CC: Marvin Theimer <theimer@microsoft.com>; Marty Humphrey
<humphrey@cs.virginia.edu>; ogsa-wg@ggf.org
<ogsa-wg@ggf.org>
- Sent: Wed Mar 15 14:26:36 2006
- Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical
design efforts"
- Hi;
- I suspect that were mostly in agreement on things. In
particular, I think your list of four core aspects is a great starting
point for a discussion on the topic.
- I just replied to an earlier email from Ravi with a description of
what Im hoping to get out of examining various HPC use cases:
- · Identification of the
simplest base case that everyone will have to implement.
- · Identification of common
cases we want to optimize.
- · Identification of how
evolution and selective extension will work.
- I totally agree with you that the base use case I described isnt
really a griduse case. But it is an HPC use case in fact it is
arguably the most common use case in current existence. J So I
think its important that we understand how to seamlessly integrate and
support that common and very simple use case.
- I also totally agree with you that we cant let a solution to the
simplest HPC use case paint us into a corner that prevents supporting the
richer use cases that grid computing is all about. Thats why Id
like to spend significant effort exploring and understanding the issues
of how to support evolution and selective extension. In an ideal
world a legacy compute cluster job scheduler could have a simple grid
shimthat let it participate at a basic level, in a natural manner, in a
grid environment, while smarter clients and HPC services could
interoperate with each other in various selectively richer manners by
means of extensions to the basic HPC grid design.
- One place where I disagree with you is your assertion that everything
needs to be designed to be restartable. While thats a good goal to
pursue Im not convinced that you can achieve it in all cases. In
particular, there are at least two cases that I claim we want to support
that arent restartable:
- · We want to be able to run
applications that arent restartable; for example, because they perform
non-idempotent operations on the external physical environment. If
such an application fails during execution then the only one who can
figure out what the proper next steps are is the end user.
- · We want to be able to
include (often-times legacy) systems that arent fault tolerant, such as
simple small compute clusters where the owners didnt think that fault
tolerance was worth paying for.
- Of course any acceptable design will have to enable systems that are
fault tolerant to export/expose that capability. To my mind its
more a matter of ensuring that non-fault-tolerant systems arent excluded
from participation in a grid.
- Other things we agree on:
- · We should certainly
examine what remote job submission systems do. We should certainly
look at existing systems like Globus, Unicore, and Legion. In
general, we should be looking at everything that has any actual
experience that we can learn from and everything that is actually
deployed and hence represents a system that we potentially need to
interoperate with. (Whether a final design is actually able to
interoperate at any but the most basic level with various exotic existing
systems is a separate issue.)
- · We should absolutely
focus on codifying what we know how to do and avoid doing research as
part of a standards process. I believe that thinking carefully
about how to support evolution and extension is our best hope for
allowing people to defer trying to bake their pet research topic into
standards since it provides a story for why todays standards dont
preclude tomorrows improvements.
- So I would propose that next steps are:
- · Continue to explore and
classify various HPC use cases of various differing levels of
complexity.
- · Describe the requirements
and limitations of existing job scheduling and remote job submission
systems.
- · Continue identifying and
discussing key featuresof use cases and potential design solutions, such
as the four that you identified in your last email.
- Marvin.
- ________________________________
- From: Carl Kesselman
[mailto:carl@isi.edu]
- Sent: Tuesday, March 14, 2006 7:50 AM
- To: Marty Humphrey; ogsa-wg@ggf.org
- Cc: Marvin Theimer
- Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical
design efforts"
- Hi,
- Just to be clear, Im not trying to suggest that the scope be
expanded. I agree with the approach of focusing on a baby step is a good
one, and many of the assumptions stated in Marvins list I am in total
agreement with. However, in taking baby steps I think that it is
important that we end up walking, and that in defining the use case, one
can easily create solutions that will not get you to the next step. This
is my point about looking at what we know how to do and have been doing
in production settings for many years now. In my mind, one of the scope
grandness problems has been that there has been far too little focus on
codifying what we know how to do in favor of using a standards process as
an excuse to design new things. So at the risk of sounding
partisan, the simplified use case that Marvin is proposing is exactly the
use case that GRAM has been doing for over ten years now (I think the
same can be said about UNICORE and Legion).
- So let me try to be constructive. One of the things that
falls out of Marvins list could be a set of basic concepts/operations
that need to be defined. These include:
- 1) A way of describing localjob configuration, i.e. where to find the
executable, data files, etc. This should be very conservative with its
assumptions on shared file systems and accessibility. In general, what
needs to be stated here are what are the underlying aspects of the
underlying resource that are exposed to the outward facing
interface.
- 2) A way of naming a submission point (should probably have a way of
modeling queues).
- 3) A core set of job management operations, submit, status, kill.
These need to be defined in such a way at to be tolerate to a variety of
failure scenarios, in that the state needs to be well defined in the case
of failure.
- 4) A state model that one can use to describe what is going on with
the jobs and a way to access that state. Can be simple (queued,
running, done), may need to be extensible. One can view the
accounting information as being exposed
- So, one thing to do would be to agree that these are (or are not) the
right four things that need to be defined and if so, start to flesh out
these in a way that supports the core use case but doesnt introduce
assumptions that would preclude more complex use cases in the
future.
- Carl
- ________________________________
- From: owner-ogsa-wg@ggf.org
[mailto:owner-ogsa-wg@ggf.org]
On Behalf Of Marty Humphrey
- Sent: Tuesday, March 14, 2006 6:32 AM
- To: ogsa-wg@ggf.org
- Cc: 'Marvin Theimer'
- Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical
design efforts"
- Carl,
- Your comments are very important. We would love to have your active
participation in this effort. Your experience is, of course, matched by
few!
- I re-emphasize that this represents (my words, not anyone elses) baby
stepsthat are necessary and important for the Grid community. In my
opinion, the biggest challenge will be to fight the urge to expand the
scope beyond a small size. You cannot ignore the possibility that the GGF
has NOT made as much progress as it should have to date. Furthermore, one
such plausible explanation is that the scope is too grand.
- -- Marty
- ________________________________
- From: owner-ogsa-wg@ggf.org
[mailto:owner-ogsa-wg@ggf.org]
On Behalf Of Carl Kesselman
- Sent: Tuesday, March 14, 2006 8:47 AM
- To: Marvin Theimer; Ian Foster; ogsa-wg@ggf.org
- Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical
design efforts"
- Hi,
- While I have no wish to engage in the what is a Gridargument, there
are some elements of your base use case that I would be concerned
about. Specifically, the assumption that the submission in into a
local clusteron which there is an existing account may lead one to a
solution that may not generalize to the solution to the case of
submission across autonomous policy domains. I would also argue
that ignoring issues of fault tolerance from the beginning is also
problematic. One must at least design operations that are
restartable (for example at most once submission semantics).
- I would finally suggest that while examining existing job schedule
systems is a good thing to do, we should also examine existing remote
submission systems (dare I say Grid systems). The basic HPC use
case is one in which there is a significant amount implementation and
usage experience.
- Thanks,
- Carl
- ________________________________
- From: owner-ogsa-wg@ggf.org
[mailto:owner-ogsa-wg@ggf.org]
On Behalf Of Marvin Theimer
- Sent: Monday, March 13, 2006 2:42 PM
- To: Ian Foster; ogsa-wg@ggf.org
- Cc: Marvin Theimer
- Subject: RE: [ogsa-wg] Paper proposing "evolutionary vertical
design efforts"
- Hi;
- Ian, you are correct that I view job submission to a cluster as being
one of the simplest, and hence most basic, HPC use cases to start
with. Or, to be slightly more general, I view job submission to a
black boxthat can run jobs be it a cluster or an SMP or an SGI NUMA
machine or what-have-you as being the simplest and hence most basic HPC
use case to start with. The key distinction for me is that the
internals of the boxare for the most part not visible to the client, at
least as far as submitting and running compute jobs is concerned.
There may well be a separate interface for dealing with things like
system management, but I want to explicitly separate those things out in
order to allow for use of boxesthat might be managed by proprietary means
or by means obeying standards that a particular job submission client is
unfamiliar with.
- I think the use case that Ravi Subramaniam posted to this mailing
list back on 2/17 is a good one to start a discussion around.
However, Id like to present it from a different point-of-view than he
did. The manner in which the use case is currently presented
emphasizes all the capabilities and services needed to handle the fully
general case of submitting a batch job to a computing
utility/service. Thats a great way of producing a taxonomy against
which any given system or design can be compared to see what it has to
offer. I would argue that the next step is to ask whats the
simplest subset that represents a useful system/design and how should one
categorize the various capabilities and services he has identified so as
to arrive at meaningful components that can be selectively used to obtain
progressively more capable systems.
- Another useful exercise to do is to examine existing job scheduling
systems in order to understand what they provide. Since in the real
world we will have to deal with the legacy of existing systems it will be
important to understand how they relate to the use cases we
explore. In the same vein, it will be important to take into
account and understand other existing infrastructures that people use
that are related to HPC use cases. Im thinking of things like security
infrastructures, directory services, and so forth. From the
point-of-view of managing complexity and reducing
total-cost-of-ownership, it will be important to understand the extent to
which existing infrastructure and services can be reused rather than
reinvented.
- To kick off a discussion around the topic of a minimalist HPC use
case, I present a straw man description of such below and then present a
first attempt at categorizing various areas of extension. The
categorization of extension areas is not meant to be complete or even all
that carefully thought-out as far as componentization boundaries are
concerned; it is merely meant to be a first contribution to get the
discussion going.
- A basic HPC use case: Compute cluster embedded within an
organization.
- · This is your basic batch job scheduling
scenario. Only a very basic state transition diagram is visible to
the client, with the following states for a job: queued, running,
finished. Additional states -- and associated state transition request
operations and functionality -- are not supported. Examples of
additional states and associated functionality include suspension of jobs
and migration of jobs.
- · Only "standard" resources can be
described, for example: number of cpus/nodes needed, memory requirements,
disk requirements, etc. (think resources that are describable by
JSDL).
- · Once a job has been submitted it can be
cancelled, but its resource requests can't be modified.
- · A distributed file system is accessible
from client desktop machines and client file servers, as well as compute
nodes of the compute cluster. This implies that no data staging is
required, that programs can be (for the most part) executed from existing
file system locations, and that no program "provisioning" is
required (since you can execute them from wherever they are already
installed). Thus in this use case all data transfer and program
installation operations are the responsibility of the user.
- · Users already have accounts within the
existing security infrastructure (e.g. Kerberos). They would like
to use these and not have to create/manage additional
authentication/authorization credentials (at least at the level that is
visible to them).
- · The job scheduling service resides at a
well-known network name and it is aware of the compute cluster and its
resources by "private" means (e.g. it runs on the head node of
the cluster and employs private means to monitor and control the
resources of the cluster). This implies that there is no need for
any sort of directory services for finding the compute cluster or the
resources it represents other than basic DNS.
- · Compute cluster system management is opaque
to users and is the concern of the compute cluster's owners. This
implies that system management is not part of the compute cluster's
public job scheduling interface. This also implies that there is no
need for a logging interface to the service. I assume that
application-level logging can be done by means of libraries that write to
client files; i.e. that there is no need for any sort of special system
support for logging.
- · A simple polling-based interface is the
simplest form of interface to something like a job scheduling service.
However, a simple call-back notification interface is a very useful
addition that potentially provides substantial performance benefits since
it can enable the avoidance of lots of unnecessary network traffic.
Only job state changes result in notification messages.
- · There are no notions of fault tolerance.
Jobs that fail must be resubmitted by the client. Neither the
cluster head node nor its compute nodes are fault tolerant. I do
expect the client software to return an indication of
failure-due-system-fault when appropriate. (Note that this may also
occur when things like network partitions occur.)
- · One does need some notion of how to deal
with orphaned resources and jobs. The notion of job lifetime and
post-expiration garbage collection is a natural approach here.
- · The scheduling service provides a fixed set
of scheduling policies, with only a few basic choices (or maybe even just
one), such as FIFO or round-robin. There is no notion, in general,
of SLAs (which are a form of scheduling policy).
- · Enough information must be returned to the
client when a job finishes to enable basic accounting
functionality. This means things like total wall-clock time the job
ran and a summary of resources used. There is not a need for the
interface to support any sort of grouping of accounting
information. That is, jobs do not need to be associated with
projects, groups, or other accounting entities and the job scheduling
service is not responsible for tracking accounting information across
such entities. As long as basic resource utilization information is
returnable for each job, accounting can be done externally to the job
scheduling service. I do assume that jobs can be uniquely
identified by some means and can be uniquely associated with some
principal entity existing in the overall system, such as a user
name.
- · Just as there is no notion of requiring the
job scheduling service to track any but the most basic job-level
accounting information, there is no notion of the service enforcing
quotas on jobs.
- · Although it is generally useful to separate
the notions of resource reservation from resource usage (e.g. to enable
interactive and debugging use of resources), it is not a necessity for
the most basic of job scheduling services.
- · There is no notion of tying multiple jobs
together, either to support things like dependency graphs or to support
things like workflows. Such capabilities must be implemented by
clients of the job scheduling service.
- Interesting extension areas:
- · Additional scheduling
policies
- o Weighted fair-share, &
- o Multiple queues
- o SLAs
- o ...
- · Extended resource
descriptions
- o Additional resource types, such as
GPUs
- o Additional types of compute resources, such
as desktop computers
- o Condor-style class ads
- · Extended job descriptions (as
returned to requesting clients and sys admins)
- · Additional classes of security
credentials
- · Reservations separated from
execution
- o Enabling interactive and debugging
jobs
- o Support for multiple competing schedulers
(incl. desktop cycle stealing and market-based approaches to scheduling
compute resources)
- · Ability to modify jobs during their
existence
- · Fault tolerance
- o Automatic rescheduling of jobs that failed
due to system faults
- o Highly available resources: This is
partly a policy statement by a scheduling service about its
characteristics and partly the ability to rebind clients to migrated
service endpoints
- · Extended state transition diagrams
and associated functionalities
- o Job suspension
- o Job migration
- o &
- · Accounting & quotas
- · Operating on arrays of jobs
- · Meta-schedulers, multiple schedulers,
and ecologies and hierarchies of multiple schedulers
- o Meta-schedulers
- · Hierarchical job scheduling with a
meta-scheduler as the only entry point; forwarding jobs to the
meta-scheduler from other subsidiary schedulers
- o Condor-style matchmaking
- · Directory services
- o Using existing directory services
- o Abstract directory service
interface(s)
- · Data transfer topics
- o Application data staging
- · Naming
- · Efficiency
- · Convenience
- · Cleanup
- o Program staging/provisioning
- · Description
- · Installation
- · Cleanup
- Marvin.
- ________________________________
- From: Ian Foster
[mailto:foster@mcs.anl.gov]
- Sent: Monday, February 20, 2006 9:20 AM
- To: Marvin Theimer; ogsa-wg@ggf.org
- Cc: Marvin Theimer; Savas Parastatidis; Tony Hey; Marty Humphrey;
gcf@grids.ucs.indiana.edu
- Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical
design efforts"
- Dear All:
- The most important thing to understand at this point (IMHO) is the
scope of this "HPC use case," as this will determine just how
minimal we can be.
- I get the impression that the principal goal may be "job
submission to a cluster." Is that correct? How do we start to
circumscribe the scope more explicitly?
- Ian.
- At 05:45 AM 2/16/2006 -0800, Marvin Theimer wrote:
- Enclosed is a paper that advocates an additional set of activities
that the authors believe that the OGSA working groups should engage
in.
- Broadly speaking, the OGSA and related working groups are already
doing a bunch of important things:
- · There is broad
exploration of the big picture, including enumeration of use cases,
taxonomy of areas, identification of research issues, etc.
- · There is work going
on in each of the horizontal areas that have been identified, such as
EMS, data services, etc.
- · There is working
going around individual specifications, such as BES, JSDL, etc.
- Given that individual specifications are beginning to come to
fruition, the authors believe it is time to also start defining vertical
profilesthat precisely describe how groups of individual specifications
should be employed to implement specific use cases in an interoperable
manner. The authors also believe that the process of defining these
profiles offers an opportunity to close the design loopby relating the
various on-going protocol and standards efforts back to the use cases in
a very concrete manner. This provides an end-to-end setting in
which to identify holes and issues that might require additional
protocols and/or (incremental) changes to existing protocols. The paper
introduces both the general notion of doing focused vertical design
effortsand then focuses on a specific vertical design effort, namely a
minimal HPC design.
- The paper derives a specific HPC design in a first principlesmanner
since the authors believe that this increases the chances of identifying
issues. As a consequence, existing specifications and the
activities of existing working groups are not mentioned and this paper is
not an attempt to actually define a specifications profile. Also,
the absence of references to existing work is not meant to imply that
such work is in any way irrelevant or inappropriate. The paper
should be viewed as a first abstract attempt to propose a new kind of
activity within OGSA. The expectation is that future open
discussions and publications will explore the concrete details of such a
proposal.
- This paper was recently sent to a few key individuals in order to get
feedback from them before submitting it to the wider GGF community.
Unfortunately that process took longer than intended and some members of
the community may have already seen a copy of the paper without knowing
the context within it was written. This email should hopefully
dispel any misconceptions that may have occurred.
- For those people who will be around on for the F2F meetings on
Friday, Marvin Theimer will be giving a talk on the contents of this
paper at a time and place to be announced.
- Marvin Theimer, Savas Parastatidis, Tony Hey, Marty Humphrey,
Geoffrey Fox
- _______________________________________________________________
- Ian
Foster
www.mcs.anl.gov/~foster
- Math & Computer Science Div. Dept of Computer Science
- Argonne National Laboratory The University of
Chicago
- Argonne, IL 60439, U.S.A. Chicago, IL 60637,
U.S.A.
- Tel: 630 252
4619
Fax: 630 252 1997
- Globus Alliance,
www.globus.org
<http://www.globus.org/>
_______________________________________________________________
Ian Foster www.mcs.anl.gov/~foster
Math & Computer Science Div. Dept of Computer Science
Argonne National Laboratory The University of Chicago
Argonne, IL 60439, U.S.A. Chicago, IL 60637, U.S.A.
Tel: 630 252 4619 Fax: 630 252 1997
Globus Alliance, www.globus.org