Condor has checkpointable jobs, which is the according equivalent. So the answer is yes. /Peter. Am 24.08.2010 um 19:28 schrieb Mariusz Mamoński:
2010/8/24 Daniel Templeton <daniel.templeton@oracle.com>:
How broadly applicable is it? OGE supports it, and I think LSF does as well. What about Condor and Torque/PBS? Torque also
Daniel
On 08/23/10 07:12 AM, Mariusz Mamoński wrote:
Hi,
In some fault-tollerant at the DRM-level scenarios job must be marked as "rerunnable". Do we want to add this attribute to the DRMAAv2 JobTemplate?
Cheers,
On 23 August 2010 15:42, Daniel Templeton<daniel.templeton@oracle.com> wrote:
I have a customer who has the resubmission of failed jobs in a greater workflow as a critical requirement. That's not actually something that OGE itself supports, so I'm all for having it in DRMAA to plug the hole.
Daniel
On 08/23/10 02:50 AM, Peter Tröger wrote:
We already have some understanding of persistency, so the implementation effort is manageable. I am more concerned about a clear separation of live monitoring information and original submission data. For the latter, I saw no use case so far ...
Best, Peter.
Am 29.07.2010 um 11:02 schrieb Andre Merzky:
Our use case for having access to the original complete job template is that the user can easily resubmit the same job - just changing for example some command line parameter, but leaving the remainder fixed. In SAGA this would look like:
saga::job::service js ("drmaa://torque.remote.net/"); saga::job::job j1 = js.get_job (jobid); // std::string saga::job::description jd = j1.get_description ();
jd.set_attributes ("Arguments", new_args); // std::vector<std::string>
saga::job::job j2 = js.create_job (jd);
I understand that the backend may no be able to keep the original job template - in that case, a 'DoesNoExist' exception on 'get_description()' would be appropriate, IMHO. If the DRMAA implementation can cache that description somewhere, fine :-)
My $0.02, Andre.
PS: saga::job::description == drmaa::job::template
Quoting [Peter Tr?ger] (Jul 29 2010): > > From: Peter Tröger<peter@troeger.eu> > Date: Thu, 29 Jul 2010 10:07:23 +0200 > To: Mariusz Mamo??ski<mamonski@man.poznan.pl>, > drmaa-wg@ogf.org > Subject: Re: [DRMAA-WG] Monitoring JobTemplate attributes for running > jobs > > > Am 28.07.2010 um 23:42 schrieb Mariusz Mamo??ski: > >> Hi, >> >> 2010/7/28 Peter Tröger<peter@troeger.eu>: >>> >>> Hi, >>> Agenda item #8 was not discussed in the call today, but it is the >>> burning >>> issue for me at the moment. Please have a look in the "Attributes >>> in >>> JobInfo" tab: >>> >>> http://spreadsheets.google.com/ccc?key=0AqyvnBscJNqxcnJBSUs5dXRrU29EUVhGOGth... >>> Currently, we allow to access the original JobTemplate from a >>> JobInfo >>> object. The idea was to get, beside the job monitoring information, >>> also the >>> information about what was demanded at submission time. >>> While doing the Condor mapping, I figured out that most of the >>> JobTemplate >>> attributes are also monitorable for a running job. This includes >>> things such >>> as executable name and working directory. Normally they should be >>> the same >>> as in the JobTemplate, but Condor and SGE (at least) have this magic >>> job >>> wrapper stuff, were the admin can automatically and silently >>> reconfigure / >>> reinterprete everything in a JobTemplate. This might lead to the >>> situation >>> were the user asks for A, and silently gets B. >>> The question: Should we drop the support for getting the JobTemplate >>> as part >>> of JobInfo, because the information is useless ? Instead, we could >>> add some >>> (or maybe most) of the JobTemplate attributes as true dynamic >>> monitoring >>> information to JobInfo. >> >> in my opinion repeating almost all attributes in this case brings >> additional redundancy in the DRMAA API (another reason may be >> performance - the JobTemplate attribute are more likely immutable). >> Why not simply request expected behavior in the spec? e.g.: >> a) the JobTemplate being part of the JobInfo struct is a reference to >> the JobTemplate used for submission (for jobs submitted outside the >> session it MUST be NULL) >> b) the JobTemplate reflects actual attributes of a job (without >> obligation that all attributes must be available - e.g. in Torque the >> actually executed command is hidden in script) > > Th interesting thing is that we already started to do this > replication, for example: JobTemplate::candidateMachines vs. > JobInfo::allocatedMachines. I still vote for finishing this replication, and > remove the JT reference from JobInfo as compensation. I also have a problem > with fetching live data from a structure called "template". > > You example from Torque underlines my argumentation - we should choose > a monitorable sub set of JobTemplate and add it to the JobInfo structure, > instead of linking the JobTemplate directly. > > Any other opinions ? > > Peter.
-- Nothing is ever easy.
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- Mariusz -- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg