DRMAA-WG April 4, 2006 call
*** new phone numbers *** *** new phone numbers *** The bi-weekly DRMAA call is scheduled for 16:00 UTC (8:00PDT - Pacific Daylight Time /10:00CDT/ 17:00 Central Europe). All Participants should use the following information to reach the conference call: ------------------------------------ * Toll Free Dial In Number for North America: 1 800 867-8609 * Toll Free Dial In Number for Germany: 0 800 101-4546 * Int'l Access/Caller Paid Dial In Number: +49 069509594678 * ACCESS CODE: 7223898 ------------------------------------ Attachments to this email: - March 21 meeting minutes Meeting Agenda: A. Meeting secretary for this meeting? B. Acceptance of the March 21, 2006 meeting minutes C. Admin - third chair update F. Open/general issues discussion - experience documents - #1125 Tracker - see the included text at the end of the agenda - Job suspension is different from triggering job rescheduling in Condor (see attached " "Re: GridWay Experience Report" mail) - Status of the test suite - post ver 1.0 issues - handling exit status for bad input / ouput / error streams (see attached "Re: [drama-wg] DRMAA TEST SUITE" mail) - misc Cheers, Hrabri ------------------------- Tracker #1125 proposed change ---------------------------- Currently we have: "Evaluates into 'exited', a non-zero value if stat was returned for a job that terminated normally. A zero value can also indicate that although the job has terminated normally an exit status is not available or that it is not known whether the job terminated normally. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information. A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()." It was proposed (Hrabri's adaptation of Peter's latest proposal) to change it to "Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed after running or finished after running (see section 2.6). A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump() functions. A zero result for the 'exited' parameter either indicates that 1) although it is known that the job was running, more information is not available 2) it is not known whether the job was running In both cases drmaa_wexitstatus() SHALL NOT provide exit status information."
Mea culpa! I have failed to synchronize my two accounts and have overlooked the latest Peter's e-mail, so please consider the inlined proposal obsolete. Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf Of Hrabri Rajic Sent: Sunday, April 02, 2006 11:43 AM To: 'DRMAA Working Group' Subject: [drmaa-wg] DRMAA-WG April 4, 2006 call
*** new phone numbers *** *** new phone numbers ***
The bi-weekly DRMAA call is scheduled for 16:00 UTC (8:00PDT - Pacific Daylight Time /10:00CDT/ 17:00 Central Europe). All Participants should use the following information to reach the conference call:
------------------------------------ * Toll Free Dial In Number for North America: 1 800 867-8609 * Toll Free Dial In Number for Germany: 0 800 101-4546 * Int'l Access/Caller Paid Dial In Number: +49 069509594678 * ACCESS CODE: 7223898 ------------------------------------
Attachments to this email:
- March 21 meeting minutes
Meeting Agenda:
A. Meeting secretary for this meeting?
B. Acceptance of the March 21, 2006 meeting minutes
C. Admin - third chair update
F. Open/general issues discussion - experience documents - #1125 Tracker - see the included text at the end of the agenda - Job suspension is different from triggering job rescheduling in Condor (see attached " "Re: GridWay Experience Report" mail) - Status of the test suite - post ver 1.0 issues - handling exit status for bad input / ouput / error streams (see attached "Re: [drama-wg] DRMAA TEST SUITE" mail) - misc
Cheers, Hrabri
------------------------- Tracker #1125 proposed change ----------------------------
Currently we have: "Evaluates into 'exited', a non-zero value if stat was returned for a job that terminated normally. A zero value can also indicate that although the job has terminated normally an exit status is not available or that it is not known whether the job terminated normally. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information. A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()."
It was proposed (Hrabri's adaptation of Peter's latest proposal) to change it to
"Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed after running or finished after running (see section 2.6). A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump() functions. A zero result for the 'exited' parameter either indicates that 1) although it is known that the job was running, more information is not available 2) it is not known whether the job was running
In both cases drmaa_wexitstatus() SHALL NOT provide exit status information."
Meeting minutes for April 4 phone conference: - March 21 meeting minutes accepted without changes - Upcoming SGE 6.0U8 release will still be DRMAA 0.95 compliant - DRMAA 1.0 compliance with SGE 6.0U9 release (3-6 months from now) or with the CVS main trunk - Small problem with strtok_r() in the test suite under Solaris, Dan will commit patched version to Sourceforge CVS - Latest text proposal for drmaa_wifexited() discussed (tracker #1125), accepted on condition that the term "ended" is removed - Peter adds updated text to the tracker - Job state after resuming from suspend state - Rough agreement that Condor and GridWay approach of restarting the job is something different then suspend ("rescheduling") - Added as post 1.0 DRMAA feature (tracker #1787) - Suspend feature and according state transition back to PS_RUNNING remains mandatory for DRMAA 1.0 (no test suite changes) - Peter informs Ruben - Discussion about job rejection in case of invalid job template - Would ease up Condor implementation, since invalid input files are detected on job submission by this system - Agreement that early rejection of invalid jobs should always be possible (e.g. compute centre checks) - Proposal for text change in new tracker #1786 - Document submission to GGF on Friday - Pending SGE experience report (Dan) - Pending updated Condor experience report (Peter) - Pending final DRMAA spec (Hrabri) Regards, Peter.
*** new phone numbers *** *** new phone numbers ***
The bi-weekly DRMAA call is scheduled for 16:00 UTC (8:00PDT - Pacific Daylight Time /10:00CDT/ 17:00 Central Europe). All Participants should use the following information to reach the conference call:
------------------------------------ * Toll Free Dial In Number for North America: 1 800 867-8609 * Toll Free Dial In Number for Germany: 0 800 101-4546 * Int'l Access/Caller Paid Dial In Number: +49 069509594678 * ACCESS CODE: 7223898 ------------------------------------
Attachments to this email:
- March 21 meeting minutes
Meeting Agenda:
A. Meeting secretary for this meeting?
B. Acceptance of the March 21, 2006 meeting minutes
C. Admin - third chair update
F. Open/general issues discussion - experience documents - #1125 Tracker - see the included text at the end of the agenda - Job suspension is different from triggering job rescheduling in Condor (see attached " "Re: GridWay Experience Report" mail) - Status of the test suite - post ver 1.0 issues - handling exit status for bad input / ouput / error streams (see attached "Re: [drama-wg] DRMAA TEST SUITE" mail) - misc
Cheers, Hrabri
------------------------- Tracker #1125 proposed change ----------------------------
Currently we have: "Evaluates into 'exited', a non-zero value if stat was returned for a job that terminated normally. A zero value can also indicate that although the job has terminated normally an exit status is not available or that it is not known whether the job terminated normally. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information. A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()."
It was proposed (Hrabri's adaptation of Peter's latest proposal) to change it to
"Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed after running or finished after running (see section 2.6). A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump() functions. A zero result for the 'exited' parameter either indicates that 1) although it is known that the job was running, more information is not available 2) it is not known whether the job was running
In both cases drmaa_wexitstatus() SHALL NOT provide exit status information."
------------------------------------------------------------------------
Betreff: Re: [drmaa-wg] DRMAA TEST SUITE Von: Peter Tröger <peter.troeger@hpi.uni-potsdam.de> Datum: Thu, 23 Mar 2006 16:00:06 -0500 An: "Ruben Santiago Montero" <rubensm@dacya.ucm.es>
An: "Ruben Santiago Montero" <rubensm@dacya.ucm.es> CC: "DRMAA Working Group" <drmaa-wg@gridforum.org>
Absender: <owner-drmaa-wg@ggf.org> Referenzen: <200603181416.03350.rubensm@dacya.ucm.es> <200603211155.00824.rubensm@dacya.ucm.es> <4420656E.1020806@hpi.uni-potsdam.de> <200603231154.56859.rubensm@dacya.ucm.es> Nachricht-ID: <44230C56.5020000@hpi.uni-potsdam.de> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_00AB_01C6564A.984CB390" X-Mailer: Microsoft Office Outlook, Build 11.0.5510 Thread-Index: AcZOvM76P8wyV9lqRV6jpspnNA+aow== In-Reply-To: <200603231154.56859.rubensm@dacya.ucm.es> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 X-Apparently-To: hrabri@sbcglobal.net via 68.142.199.165; Thu, 23 Mar 2006 13:00:26 -0800 X-Originating-IP: [140.221.10.4] X-Original-To: grdfm-drmaa-wg@mailbouncer.mcs.anl.gov x-fsavag4mse-ts: dbb6c6d4fbd7d8b3 X-OriginalArrivalTime: 23 Mar 2006 21:00:01.0725 (UTC) FILETIME=[C082BED0:01C64EBC]
Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ?
Sure. It make sense for me also.
There is also a validator in the state diagram (Section 2.6). I am just wondering if a DRMAA implementation could just reject the jobs in these tests at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
The spec is unclear here, since the description of the input / ouput / error parameters demands a particular job state - DRMAA_PS_FAILED. You can only have a job state when you have a job id. YOu can only have a job id when drmaa_run() was successfull. I really would like to have the opportunity of DRMAA_ERRNO_DENIED_BY_DRM also in this case, but then we have to relax the description of the according job template attributes.
Sounds like another issue for the next phone call. Hrabri ?
Regards, Peter.
------------------------------------------------------------------------
Betreff: [drmaa-wg] Minutes for DRMAA WG con-call 03/21/2006 Von: "Andreas Haas" <Andreas.Haas@Sun.COM> Datum: Tue, 21 Mar 2006 12:58:11 -0500 An: "DRMAA Working Group" <drmaa-wg@gridforum.org>
An: "DRMAA Working Group" <drmaa-wg@gridforum.org>
Absender: <owner-drmaa-wg@ggf.org> Nachricht-ID: <Pine.GSO.4.53.0603211807160.41800@sr-ergb01-01> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_00AF_01C6564A.98516E80" X-Mailer: Microsoft Office Outlook, Build 11.0.5510 Thread-Index: AcZNEQuJoof2neN9S9actDErgu5YCA== X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 X-Apparently-To: hrabri@sbcglobal.net via 68.142.199.167; Tue, 21 Mar 2006 09:58:23 -0800 X-Originating-IP: [140.221.10.4] X-Original-To: grdfm-drmaa-wg@mailbouncer.mcs.anl.gov X-X-Sender: ah114088@sr-ergb01-01
Attendees: Roger, Peter, Daniel, Hrabri and Andreas
Last meeting minutes accepted without corrections.
* Harbri proposes to add Peter as 3rd chair for DRMAA WG. Peter says he would be willing to do it. Result of the election is 5 votes pro and 0 votes against!
* Discussion about ST_INPUT_FILE_FAILURE test case brought up by Ruben Santiago Montero. There is agreement the testing procedure needs to be to comply with the specification as proposed by Ruben.
* Andreas to review change in spec for tracker item 1125
------------------------------------------------------------------------
Betreff: Re: GridWay Experience Report Von: "Peter Troeger" <peter.troeger@hpi.uni-potsdam.de> Datum: Thu, 23 Mar 2006 10:33:19 -0500 An: "Andreas Haas" <Andreas.Haas@Sun.COM>
An: "Andreas Haas" <Andreas.Haas@Sun.COM> CC: "Ruben Santiago Montero" <rubensm@dacya.ucm.es>, "Hrabri Rajic" <hrabri@sbcglobal.net>, Ignacio Martín Llorente <llorente@dacya.ucm.es>, "Roger Brobst" <rbrobst@cadence.com>, "Daniel Templeton" <Dan.Templeton@Sun.COM>
Referenzen: <200603211212.41381.rubensm@dacya.ucm.es> <44207279.4090500@hpi.uni-potsdam.de> <200603231153.39610.rubensm@dacya.ucm.es> <Pine.GSO.4.53.0603231428390.41800@sr-ergb01-01> Nachricht-ID: <4422BFBF.1000800@hpi.uni-potsdam.de> MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="----=_NextPart_000_00B3_01C6564A.98565080" X-Mailer: Microsoft Office Outlook, Build 11.0.5510 Thread-Index: AcZOjxz4c6sJabBCTiSfFifkWcbX0w== In-Reply-To: <Pine.GSO.4.53.0603231428390.41800@sr-ergb01-01> X-MimeOLE: Produced By Microsoft MimeOLE V6.00.2900.2180 X-Apparently-To: hrabri@sbcglobal.net via 68.142.199.172; Thu, 23 Mar 2006 07:33:20 -0800 X-Originating-IP: [141.89.225.123] X-Header-Overseas: Mail.from.Overseas.source.mail3.hpi.uni-potsdam.de x-fsavag4mse-ts: ce3a50e13d5a79e X-OriginalArrivalTime: 23 Mar 2006 15:33:19.0057 (UTC) FILETIME=[1C68FC10:01C64E8F] X-Accept-Language: de-DE, de, en-us, en X-Enigmail-Version: 0.93.0.0
- State of jobs after suspension: I loved to read this, since I had exactly the same problem in the Condor DRMAA implementation. I ended up with marking such jobs as "was suspended before", in order to give the right active state afterwards. If we want to change the spec according to this, we have a post 1.0 issue.
Great!. I think I can just make the same thing in GridWay DRMAA.
Hm ... I doubt this is a good idea. Job suspension is different from triggering job rescheduling. If implementing job suspension is a severe problem for DRM vendors, I believe that should be rather an argument for not making it mandatory rather than deviating from the standard.
Even though we are running out of time for spec changes, this should be a topic for the next DRMAA phone conference. Hrabri, could you put this on the agenda ?
Regards, Peter.
participants (2)
-
Hrabri Rajic -
Peter Troeger