
Hi; You raise good questions -- all of which are reasonable things to try to address in one or more extensions. Concerning the high throughput community, my impression has been that many (but by no means all) workloads consist of essentially idempotent jobs, so that providing at-most-once semantics isn't that crucial since running a job more than once by accident doesn't really hurt anything. What the client wants is to one-way-or-another get results back for all the work items to be done. I don't quite understand your reference to statistical failure rates and would be interested to learn more about what you mean. It seems to me that a client will keep resubmitting jobs until he gets answers back for all of the work items they represent, irrespective of random or non-random failure rates in job processing. Perhaps I'm misunderstanding the workloads you have in mind. Thanks, Marvin. -----Original Message----- From: owner-ogsa-wg@ggf.org [mailto:owner-ogsa-wg@ggf.org] On Behalf Of Karl Czajkowski Sent: Tuesday, March 21, 2006 8:16 PM To: Marvin Theimer Cc: Ian Foster; Marty Humphrey; Carl Kesselman; ogsa-wg@ggf.org Subject: Re: [ogsa-wg] Paper proposing "evolutionary vertical design efforts" On Mar 21, Marvin Theimer modulated:
I know that systems like LSF get used in high throughput settings where the service time for a job request is an issue.... ... If my assumption is correct, then this common use case in the HPC world may be one that many, if not most job schedulers would have a hard time supporting if they have to provide at-most-once transactional semantics for all job submissions.
I do not think anyone claimed that at-most-once semantics should be mandated on all requests. Certainly nobody from Globus says this... it is an optional feature of our job submission protocol, to be chosen by the client depending on their needs. I think the question is much more about whether (or how many times) an optional at-most-once extension mechanism is defined. Secondarily, there is the question of efficiently determining if it (as an extension) is available in a remote service. A third interesting question might be determining what the "cost" of the extension is versus the cost of having lost jobs against an unknown remote service implementation when setting up to do an extremely high throughput run as you describe. The high throughput case is interesting to me, because it is precisely the user community that demanded an efficient at-most-once semantics from GRAM! They are the ones who blast enough jobs through to notice statistical failure rates and the cost of recovery. karl -- Karl Czajkowski karlcz@univa.com