RE: More comments: HPC Use Cases -- Base Case and Common Cases

Hi;

My comments are in-line.

Marvin.

-----Original Message-----
From: Balle, Susanne [mailto:Susanne.Balle@hp.com]
Sent: Friday, April 28, 2006 9:21 AM
To: Marvin Theimer
Cc: ogsa-wg@ggf.org
Subject: More comments: HPC Use Cases -- Base Case and Common Cases

Marvin,

Enclosed find the remaining of my comments:

Page 5. (top paragraph) I think I know what you mean by "with the ambiguity of distinguishing between scheduler crashes and network partitions ". "scheduler crashes" is obvious. I am assuming that by "network partitions" you are inferring that various sub-networks are going to have different response time which will have an effect on the time it takes to deliver a call-back message.

Reading further along in the same paragraph I am now not sure I know what you mean by "network partitions".

[MT] I think Ian’s email response to you does a good job of characterizing the key point about network partitions. An additional thing to keep in mind: A network partition may prevent messages from getting through between two parties for arbitrary amounts of time, even though both parties are still up-and-running. So it may cause various timeouts in the client and server software to occur. Note however, that when something like a message response timeout occurs you can’t tell whether you didn’t get a response back because the other party has crashed or because there is a network partition between the two of you.

Page 5. Section 3.3

The topic of this section is clear (described in the first line of the paragraph) but of the section is a little confusing.

"possibility that a client cannot directly tell whether its job submission request has been successful ..." --> Do we expect the client to re-submit the job if the submission failed or do we expect users to inspect that their job has in fact been submitted and resubmit if needed? I am wondering if we assume the later if that wouldn’t result in users re-launching their jobs several time if they do not see their job listed in some state when pulling the job scheduler for the state of their job?

I guess I do not understand why so much emphasis is put on the "At-Most-Once" or "Exactly-Once".

Can't the client poll the Job scheduler and ask the JS for a list of jobs queued, running, terminated, failed, etc.? It might be useful for the client to be able to submit jobs with a special keyword like JOB_SUBMITTED_BY since that would reduce the list it gets back. It would be nice if the value for the keyword was a unique identifier but doesn't have to be. Most schedulers allows you to name or associate a group to programs so that feature could be used as special keyword.

[MT] See Ian’s response to your email and my response to his email.

Page 6. Section 3.4

General question: Are you taking into account that user applications will require different software?

[MT] Yes, that definitely needs to be dealt with and I was assuming as much. Note, however, that the simplest base case does not require it since is just a case with known, homogeneous hardware. But it’s definitely a common use case for which we’ll want to define an appropriate extension (that will be widely – if not universally – provided).

1. For example if my executable is compiled for Linux, Intel platform then I would like to run it on a Linux,Intel system and not a Linux,AMD system.

[MT] Totally agree: One definitely needs to be able to describe this case.

2. Are you assuming that the program will be compiled on the fly on the allocated system? or pre-compiled and then staged?

[MT] Both can happen in various common use cases. I would argue that the pre-compiled case is more common than the JIT case – at least today, if not in the future.

I agree that staging the data is going to be an interesting topic.

All this is probably out-of-band for the HPC JS Profile but should be considered somewhere. I am sure it is I just don’t know where.

[MT] I would argue that data staging MUST be part of the HPC JS profile. You can’t implement various common cases without it.

I like the section on virtual machines and think that they will be used more and more in the future.

Page 7. Extended Resource Features

The second approach (arbitrary resource types ...) is the only one that make sense to me since that approach is extensible. I believe that Moab is implementing this approach as well.

[MT] It’s definitely a common case. But I can also imagine utility for defining various standard “sets” of resources supported, so that matchmaking becomes a more efficient process. The combination of the two approaches is also interesting: it would essentially be an optimization of matchmaking for arbitrarily defined resource types. In case you’re arguing that the arbitrary extension approach should be the base case, recall that the point of the base case is to define the simplest, most minimal case from which to build up from. There are definitely schedulers out that (that people will want to interoperate with) that don’t support arbitrary matchmaking functionality.

Page 8. Extended Client/System Administrator Operations

Are you assuming that System Administrators will be able to perform sys admin operations on somebody else's system? I don’t think that is right.

[MT] Imagine a forest of clusters that are all within a single data center or within a single admin domain and that have a meta-scheduler in front of them. I claim it makes perfect sense for a sys admin to want to administer the entire forest (I’ve seen several customer sites that are configured this way). In that case you want to allow the meta-scheduler to request sys admin requests of the job schedulers that sit on the head node of each cluster. In general, you want to allow controlled delegation of operational capabilities. That doesn’t violate the notion that local systems should remain under the ultimate authority of their local admins; just that those admins should be allowed to selectively grant authority to additional parties.

You mention suspend-resume. Are you thinking of suspending a job running across several clusters that are in different organizations? Or just suspending a job on a single cluster/server?

[MT] By far the most common case (I expect) is that sys admins will suspend/resume jobs within a single domain. But if you assume virtual organizations then it seems perfectly reasonable that a sys admin is given authority to do so throughout the entire virtual organization.

Again I am trying to figure out how this fit in with "One important aspect, is that the individual clusters should remain under the control of their local system administrator and/or of their local policies".

I believe that suspend-resume is a JS operation or an operation to be performed by the local sys admin, NOT by remote sys admins.

[MT] I have no problem with a system design that implements those kinds of sys admin operations by delegation of the actual requests, so that implementation of a request is carried out by the local infrastructure. (In fact, one might argue that that’s the only things really work in practice …) But I will argue that there shouldn’t have to be a human local sys admin in the loop if that local sys admin has decided to explicitly delegate authority to another party to act on their behalf.

If we are now talking about a meta-scheduler then yes it makes sense. In the case of a meta-scheduler it might take over the individual JS and schedule jobs base on its own policies, on its job reservation system, etc. In this case I look at it as we have one deciding entity (the meta-scheduler) and several "slaves". Moab and Maui are the only meta-scheduler I an familiar with and they do take over the scheduling decisions/node allocations/etc and just submit jobs to the local job schedulers.

[MT] If you assume selective delegation – e.g. a meta-scheduler may perform/request some operations on some parts of a local system, but not all, then the notion of master and slaves is arguably not that appropriate since the local schedulers act as slaves in some circumstances and as masters in others.

This does of course assume that the local system administrators have agreed on a schedule when their cluster is shared within this greater infrastructure. This is a different approach than having jobs passed onto their local scheduler and run on their systems.

This just seems to be a different approach from the one that is taken in this paper.

I might be wrong. If I am please educate me.

Page 9. Section 3.10

Don’t forget UPC (Unified Parallel C: http://upc.nersc.gov/). This parallel programming paradigm is getting more and more interest from several communities.

We'll need to provide support for UPC as well.

[MT] Is UPC visible at the job scheduler level? If so, would you be willing to provide a brief description of its needs/expectations from the job scheduling system? Thanks.

Page 10. Section 3.13

A meta-scheduler approach that make sense to me is to allow developers to submit their job to their local cluster using their "favorite" scheduler commands and then have the meta-scheduler load-balance the work and forward the job to another system/cluster if needed. Moab from cluster resources support this approach even if the clusters have different JSs. They have a list of supported JS such as LSF, PBSpro, SLURM, etc. and they can "translate" one JS's commands into another within that supported set.

[MT] I believe that that case is already mentioned in section 3.13.

Page 11. SLURM is missing.

[MT] It would be great if you or someone else could create an appropriate appendix entry for it. J

Let me know what you think,

Regards

Susanne

---------------------------------------------------------------

Susanne M. Balle,

Hewlett-Packard

High Performance Computing R&D Organization

110 Spit Brook Road

Nashua, NH 03062

Phone: 603-884-7732

Fax: 603-884-0630

Susanne.Balle@hp.com