
On Mar 21, Marvin Theimer modulated:
Hi;
Whereas I agree with you that at-most-once semantics are very desirable, I would like to point out that not all existing job schedulers implement them. I know that both LSF and CCS (the Microsoft HPC job scheduler) don’t. I’ve been trying to find out whether PBS and SGE do or don’t.
Aside from the comment Ian made that it is potentially useful to have at-most-once message semantics even if there is some potential for a local failure in the message processing (handoff from message layer to local scheduler), I believe LSF does support "hold" states where a job can be submitted and released as a two-phase interaction. Such a mechanism is sufficient to implement a complete end-to-end at-most-once submission by implementing logging in the message engine to associate the client message with a local job handle before submitting. Most schedulers also support job naming/annotation fields which are exposed through the job query interface. This can also be used to implement a reliable correlation between message/request IDs and the local implementation job. This can also be used to synthesize an at most once semantics in front of the scheduler, by determining if a local job exists before trying to resubmit with the same name. This behavior can be hidden in the message engine and "local adapter".
So, this brings up the following slightly more general question: should the simplest base case be the simplest case that does something useful, or should it be more complicated than that? I can see good arguments on both sides:
I find it a little disconcerting that this question is still being asked about job systems, because there is a history of having made and retracted this decision before. We did it in Globus with GRAM, and I think several of the other Grid projects did as well... The subset interface is not sufficient for users. A solution MUST incorporate an interoperable subset plus a robust extensibility mechanism to allow any of: 1. incremental evolution of the core subset 2. vendor-specific localization/extension 3. community/site-specific localization/extension 4. discovery of extended mode support 5. graceful degradation in the absence of extended mode support In my opinion, anything short of this will just add another non-interoperable interface to the hodge-podge of non-interoperable solutions that already exist. karl -- Karl Czajkowski karlcz@univa.com