[ogsa-bes-wg] RE: [ogsa-wg] More comments: HPC Use Cases -- Base Case and Common Cases

28 Apr 2006

      Ian,

Thanks for you response. 

I joined the ogsa-bes-wg mailing list last week and I am looking for
hearing more about the progress made in this area. I will look for the
ESI specification in the ogsa-bes-wg mailing archive.

Thanks Susanne

	-----Original Message-----
	From: Ian Foster [mailto:foster@mcs.anl.gov] 
	Sent: Friday, April 28, 2006 12:46 PM
	To: Balle, Susanne; theimer@microsoft.com
	Cc: ogsa-wg@ggf.org; OGSA-BES-wg@ggf.org
	Subject: Re: [ogsa-wg] More comments: HPC Use Cases -- Base Case
and Common Cases

	Susanne:

	I'd like to respond to your comments.

	I believe that the reference to "network partitions" refers to
the fact that in a distributed environment, unlike a single machine
environment, we cannot be sure that messages will be delivered: a
network failure can result in any message being lost. Thus, a job
submission may not receive a response, and in that case, we cannot know
whether the job was submitted (i.e., the request got through, but the
response was lost) or not (i.e., the request was lost).

	One convenient way of dealing with this problem is to allow
users to associate a "unique job id" with a job request. A scheduler
that receives a second or subsequent submission with the same jobid
should simply return the response it provided to the first request
received.

	It's true that a user can achieve a similar effect by searching
for the submitted job in the scheduler queue. However, this approach is
more complex to do, and also suffers from the problem that the job might
have already completed and thus can't be found that way.

	In our recently circulated ESI specification, we proposed an
optional "unique job id" field in JSDL as a way of addressing this
requirement. This notion was discussed on a BES call, and people seemed
sympathetic to the idea.

	Ian.

	At 12:20 PM 4/28/2006 -0400, Balle, Susanne wrote:

		Marvin,

		Enclosed find the remaining of my comments:

		Page 5. (top paragraph) I think I know what you mean by
"with the
		ambiguity of distinguishing between scheduler crashes
and network
		partitions ". "scheduler crashes" is obvious. I am
assuming that by
		"network partitions" you are inferring that various
sub-networks are
		going to have different response time which will have an
effect on the
		time it takes to deliver a call-back message.

		Reading further along in the same paragraph I am now not
sure I know
		what you mean by "network partitions".

		Page 5. Section 3.3
		The topic of this section is clear (described in the
first line of the
		paragraph) but of the section is a little confusing. 

		"possibility that a client cannot directly tell whether
its job
		submission request has been successful ..." --> Do we
expect the client
		to re-submit the job if the submission failed or do we
expect users to
		inspect that their job has in fact been submitted and
resubmit if
		needed? I am wondering if we assume the later if that
wouldn't result in
		users re-launching their jobs several time if they do
not see their job
		listed in some state when pulling the job scheduler for
the state of
		their job?

		I guess I do not understand why so much emphasis is put
on the
		"At-Most-Once" or "Exactly-Once".

		Can't the client poll the Job scheduler and ask the JS
for a list of
		jobs queued, running, terminated, failed, etc.? It might
be useful for
		the client to be able to submit jobs with a special
keyword like
		JOB_SUBMITTED_BY since that would reduce the list it
gets back. It would
		be nice if the value for the keyword was a unique
identifier but doesn't
		have to be. Most schedulers allows you to name or
associate a group to
		programs so that feature could be used as special
keyword.

		Page 6. Section 3.4
		General question: Are you taking into account that user
applications
		will require different software? 

		1. For example if my executable is compiled for Linux,
Intel platform
		then I would like to run it on a Linux,Intel system and
not a Linux,AMD
		system. 

		2. Are you assuming that the program will be compiled on
the fly on the
		allocated system? or pre-compiled and then staged?

		I agree that staging the data is going to be an
interesting topic.

		All this is probably out-of-band for the HPC JS Profile
but should be
		considered somewhere. I am sure it is I just don't know
where.

		I like the section on virtual machines and think that
they will be used
		more and more in the future.

		Page 7. Extended Resource Features
		The second approach (arbitrary resource types ...) is
the only one that
		make sense to me since that approach is extensible. I
believe that Moab
		is implementing this approach as well.

		Page 8. Extended Client/System Administrator Operations

		Are you assuming that System Administrators will be able
to perform sys
		admin operations on somebody else's system? I don't
think that is right.

		You mention suspend-resume. Are you thinking of
suspending a job running
		across several clusters that are in different
organizations? Or just
		suspending a job on a single cluster/server?

		Again I am trying to figure out how this fit in with
"One important
		aspect, is that the individual clusters should remain
under the control
		of their local system administrator and/or of their
local policies". 

		I believe that suspend-resume is a JS operation or an
operation to be
		performed by the local sys admin, NOT by remote sys
admins.

		If we are now talking about a meta-scheduler then yes it
makes sense. In
		the case of a meta-scheduler it might take over the
individual JS and
		schedule jobs base on its own policies, on its job
reservation system,
		etc. In this case I look at it as we have one deciding
entity (the
		meta-scheduler) and several "slaves". Moab and Maui are
the only
		meta-scheduler I an familiar with and they do take over
the scheduling
		decisions/node allocations/etc and just submit jobs to
the local job
		schedulers.

		This does of course assume that the local system
administrators have
		agreed on a schedule when their cluster is shared within
this greater
		infrastructure. This is a different approach than having
jobs passed
		onto their local scheduler and run on their systems.

		This just seems to be a different approach from the one
that is taken in
		this paper.
		I might be wrong. If I am please educate me.

		Page 9. Section 3.10

		Don't forget UPC (Unified Parallel C:
http://upc.nersc.gov/). This
		parallel programming paradigm is getting more and more
interest from
		several communities.
		We'll need to provide support for UPC as well.

		Page 10. Section 3.13
		A meta-scheduler approach that make sense to me is to
allow developers
		to submit their job to their local cluster using their
"favorite"
		scheduler commands and then have the meta-scheduler
load-balance the
		work and forward the job to another system/cluster if
needed. Moab from
		cluster resources support this approach even if the
clusters have
		different JSs. They have a list of supported JS such as
LSF, PBSpro,
		SLURM, etc. and they can "translate" one JS's commands
into another
		within that supported set.

		Page 11. SLURM is missing.

		Let me know what you think,

		Regards

		Susanne

---------------------------------------------------------------
		Susanne M. Balle,
		Hewlett-Packard
		High Performance Computing R&D Organization
		110 Spit Brook Road
		Nashua, NH 03062

		Phone: 603-884-7732
		Fax:     603-884-0630

		Susanne.Balle@hp.com

	_______________________________________________________________
	   Ian Foster, Director, Computation Institute
	Argonne National Laboratory & University of Chicago
	Argonne: MCS/221, 9700 S. Cass Ave, Argonne, IL 60439
	Chicago: Rm 405, 5640 S. Ellis Ave, Chicago, IL 60637
	Tel: +1 630 252 4619.  Web: www.ci.uchicago.edu
<http://www.ci.uchicago.edu/> .
	      Globus Alliance: www.globus.org <http://www.globus.org/> .