Dear all, the OGF folks recently managed to create a Subversion repository for us on GridForge. The DRMAA test suite is therefore no longer part of the DRMAA condor sources, but separately located here: http://forge.gridforum.org/sf/go/projects.drmaa-wg/scm.drmaa_testsuite The links on our home page are corrected. I also fixed the Windows support - the test suite now builds with Visual Studio 2005. I tested everything with the latest Condor versions for Linux and Windows. Maybe somebody can do that also for SGE. Best regards, Peter.
On Thu, Aug 28, 2008 at 4:35 PM, Peter Tröger <peter@troeger.eu> wrote:
I tested everything with the latest Condor versions for Linux and Windows. Maybe somebody can do that also for SGE.
I'd suggest adding a simple test case for the drmaa_wifaborted() confusion I bumped into recently. The test could: - submit a job in hold state, - drmaa_control(TERMINATE) and drmaa_wait(), - assure that drmaa_wifaborted() == true, drmaa_wifexited() == drmaa_wifsignaled() == drmaa_wifcoredumped() == false, - submit a long job (e.g. /bin/sleep 3600), - wait (polling) for it to start, - drmaa_control(TERMINATE) and drmaa_wait(), - assure that drmaa_wifsignaled() == true, drmaa_wifexited() == drmaa_wifaborted() == drmaa_wifcoredumped() == false, I'm not 100% sure, but I guess SGE fails the second test... -- Piotr Domagalski
Dear all,
I'd suggest adding a simple test case for the drmaa_wifaborted() confusion I bumped into recently. The test could:
Great idea, I did that by extending two existing test cases (ST_SUBMIT_IN_HOLD_DELETE and ST_SUBMIT_KILL_SIG). The test suite version is therefore now 1.6.0.
- submit a job in hold state, - drmaa_control(TERMINATE) and drmaa_wait(), - assure that drmaa_wifaborted() == true, drmaa_wifexited() == drmaa_wifsignaled() == drmaa_wifcoredumped() == false,
- submit a long job (e.g. /bin/sleep 3600), - wait (polling) for it to start, - drmaa_control(TERMINATE) and drmaa_wait(), - assure that drmaa_wifsignaled() == true, drmaa_wifexited() == drmaa_wifaborted() == drmaa_wifcoredumped() == false,
wifexited() must be 0 for the first case, and !=0 for the second case. GFD.133 is (finally) very clear about that: "Evaluates into 'exited' a non-zero value if stat was returned for a job that either failed after running or finished after running" wifexited() should tell you if the job has an exit code, which is only possible if it ever was executed.
I'm not 100% sure, but I guess SGE fails the second test...
I tested the latest Condor for Windows and Linux. It fails now with the first test, since it returns wifexited()!=0 even though the job was terminated before running. I will fix that for the next Condor release, if we agree on my interpretation of the spec. Somebody else (Andreas ?) need to check SGE. Thanks, Peter.
On Fri, Aug 29, 2008 at 1:25 PM, Peter Tröger <peter@troeger.eu> wrote:
Great idea, I did that by extending two existing test cases (ST_SUBMIT_IN_HOLD_DELETE and ST_SUBMIT_KILL_SIG). The test suite version is therefore now 1.6.0.
That's great!
- submit a job in hold state, - drmaa_control(TERMINATE) and drmaa_wait(), - assure that drmaa_wifaborted() == true, drmaa_wifexited() == drmaa_wifsignaled() == drmaa_wifcoredumped() == false,
- submit a long job (e.g. /bin/sleep 3600), - wait (polling) for it to start, - drmaa_control(TERMINATE) and drmaa_wait(), - assure that drmaa_wifsignaled() == true, drmaa_wifexited() == drmaa_wifaborted() == drmaa_wifcoredumped() == false,
wifexited() must be 0 for the first case, and !=0 for the second case.
Yep, my fault -- I was under the impression that signalled POSIX process doesn't have exit status which is obviously not true. Another thing. Now the following holds: wifaborted() == true if and only if wifexited() == false. Do we actually need wifaborted() then? When drmaa_wait() returns with success and wifexited() == false then we know that the process must have been aborted. Am I missing something? Or is it for this "we-have-no-idea-what-have-happend" state when aborted() == false && exited() == false? -- Piotr Domagalski
Or is it for this "we-have-no-idea-what-have-happend" state when aborted() == false && exited() == false?
When drmaa_wait returns successfully, we *should* be able to determine if the job was aborted, signalled, or exited itself. In the latter two cases drmaa *should* be able to return either the signal or exitValue (respectively). I'm confident there are circumstances when the DRM can determine that a job is no longer running, but cannot determine why it stopped running. There are some interesting timing issues which arise when a machine is unceremoniously power-toggled ! Yep, the "we-have-no-idea-what-happend" state. -Roger ----Original Message---- From: "Piotr Domagalski" <piotr.domagalski@man.poznan.pl> Sender: drmaa-wg-bounces@ogf.org To: drmaa-wg@ogf.org Subject: Re: [DRMAA-WG] DRMAA test suite moved Date: Fri, 29 Aug 2008 14:29:18 +0200 On Fri, Aug 29, 2008 at 1:25 PM, Peter Tröger <peter@troeger.eu> wrote:
Great idea, I did that by extending two existing test cases (ST_SUBMIT_IN_HOLD_DELETE and ST_SUBMIT_KILL_SIG). The test suite version is therefore now 1.6.0.
That's great!
- submit a job in hold state, - drmaa_control(TERMINATE) and drmaa_wait(), - assure that drmaa_wifaborted() == true, drmaa_wifexited() == drmaa_wifsignaled() == drmaa_wifcoredumped() == false,
- submit a long job (e.g. /bin/sleep 3600), - wait (polling) for it to start, - drmaa_control(TERMINATE) and drmaa_wait(), - assure that drmaa_wifsignaled() == true, drmaa_wifexited() == drmaa_wifaborted() == drmaa_wifcoredumped() == false,
wifexited() must be 0 for the first case, and !=0 for the second case.
Yep, my fault -- I was under the impression that signalled POSIX process doesn't have exit status which is obviously not true. Another thing. Now the following holds: wifaborted() == true if and only if wifexited() == false. Do we actually need wifaborted() then? When drmaa_wait() returns with success and wifexited() == false then we know that the process must have been aborted. Am I missing something? Or is it for this "we-have-no-idea-what-have-happend" state when aborted() == false && exited() == false? -- Piotr Domagalski -- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
On Fri, Aug 29, 2008 at 5:58 PM, Roger Brobst <rogerb@cadence.com> wrote:
When drmaa_wait returns successfully, we *should* be able to determine if the job was aborted, signalled, or exited itself. In the latter two cases drmaa *should* be able to return either the signal or exitValue (respectively). I'm confident there are circumstances when the DRM can determine that a job is no longer running, but cannot determine why it stopped running. There are some interesting timing issues which arise when a machine is unceremoniously power-toggled !
Yep, the "we-have-no-idea-what-happend" state.
Thanks, that does make sense. -- Piotr Domagalski
participants (3)
-
Peter Tröger -
Piotr Domagalski -
Roger Brobst