normal exit status causes drmaa_wifaborted
I don't understand why causing a simple non-zero exit status is causing drmaa_wifaborted to be set. The easiest way for me to demo this is to change line 38 of t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be my $remote_cmd = "csh -c 'exit 1'"; And then running "make test TEST_VERBOSE=1", which would produce: <SNIP> ok 12 - drmaa_wait says jobid did not change? # Failed test (t/08_posix_tests.t at line 83) not ok 13 - drmaa_wait should say there is more info available in POSIX funcs ok 15 - drmaa_wifaborted error? # Failed test (t/08_posix_tests.t at line 90) not ok 16 - normal job should not abort. ok 17 - drmaa_wifexited returned 3 of 3 args ok 18 - drmaa_wifexited error? # Failed test (t/08_posix_tests.t at line 97) not ok 19 - normal job should exit. <SNIP> I've attached test 8 to this email, in case you want to see how the calls are made in Perl. Any ideas? Thanks, Tim Harsch
Tim, Looks like something localized to the Perl binding or your configuration. I did the same test on the Java language binding, which is also based on the C binding, and it worked fine for me. Output below, program attached. Could the problem be that you're sending the full command line as the remote command and "1" as the args, instead of "csh" as the remote command and "-c", "'exit 1'" as the args? What is the meaning of setting the args to "1"? --- % java -cp /sge/lib/drmaa.jar:. -d64 Test Exited: true Aborted: false Signaled: false --- Daniel Tim Harsch wrote:
I don't understand why causing a simple non-zero exit status is causing drmaa_wifaborted to be set.
The easiest way for me to demo this is to change line 38 of t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be my $remote_cmd = "csh -c 'exit 1'";
And then running "make test TEST_VERBOSE=1", which would produce: <SNIP> ok 12 - drmaa_wait says jobid did not change? # Failed test (t/08_posix_tests.t at line 83) not ok 13 - drmaa_wait should say there is more info available in POSIX funcs ok 15 - drmaa_wifaborted error? # Failed test (t/08_posix_tests.t at line 90) not ok 16 - normal job should not abort. ok 17 - drmaa_wifexited returned 3 of 3 args ok 18 - drmaa_wifexited error? # Failed test (t/08_posix_tests.t at line 97) not ok 19 - normal job should exit. <SNIP>
I've attached test 8 to this email, in case you want to see how the calls are made in Perl.
Any ideas?
Thanks, Tim Harsch ------------------------------------------------------------------------
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
Hi Daniel, Thanks so much for your help. I'm still trying to determine the problem, but your help has gotten me further along I think. My apologies about the setup in that script I sent. It was a left over from the original test case... after reading your message I altered the script to have just "csh" as the cmd, and two args "-c" and "'exit 1'". I got the same results. I ran your Java test, which gave me the results you show here. My next step is to mimic your Java test, with a bare min Perl test and see if they produce same results. I'll get back to you soon on that... For now, thanks a bunch! ----- Original Message ----- From: "Daniel Templeton" <Dan.Templeton@Sun.COM> To: "Tim Harsch" <harsch1@llnl.gov> Cc: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 10:48 AM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
Tim,
Looks like something localized to the Perl binding or your configuration. I did the same test on the Java language binding, which is also based on the C binding, and it worked fine for me. Output below, program attached.
Could the problem be that you're sending the full command line as the remote command and "1" as the args, instead of "csh" as the remote command and "-c", "'exit 1'" as the args? What is the meaning of setting the args to "1"?
---
% java -cp /sge/lib/drmaa.jar:. -d64 Test Exited: true Aborted: false Signaled: false
---
Daniel
Tim Harsch wrote:
I don't understand why causing a simple non-zero exit status is causing drmaa_wifaborted to be set.
The easiest way for me to demo this is to change line 38 of t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be my $remote_cmd = "csh -c 'exit 1'";
And then running "make test TEST_VERBOSE=1", which would produce: <SNIP> ok 12 - drmaa_wait says jobid did not change? # Failed test (t/08_posix_tests.t at line 83) not ok 13 - drmaa_wait should say there is more info available in POSIX funcs ok 15 - drmaa_wifaborted error? # Failed test (t/08_posix_tests.t at line 90) not ok 16 - normal job should not abort. ok 17 - drmaa_wifexited returned 3 of 3 args ok 18 - drmaa_wifexited error? # Failed test (t/08_posix_tests.t at line 97) not ok 19 - normal job should exit. <SNIP>
I've attached test 8 to this email, in case you want to see how the calls are made in Perl.
Any ideas?
Thanks, Tim Harsch ------------------------------------------------------------------------
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
--------------------------------------------------------------------------------
import org.ggf.drmaa.*;
public class Test { public static void main(String[] args) throws Exception { Session s = SessionFactory.getFactory().getSession(); s.init(""); JobTemplate jt = s.createJobTemplate(); jt.setRemoteCommand("/usr/bin/csh"); jt.setArgs(new String[] {"-c", "'exit 1'"}); String job = s.runJob(jt); JobInfo ji = s.wait(job, s.TIMEOUT_WAIT_FOREVER); System.out.println("Exited: " + ji.hasExited()); System.out.println("Aborted: " + ji.wasAborted()); System.out.println("Signaled: " + ji.hasSignaled()); s.deleteJobTemplate(jt); s.exit(); } }
Daniel, By what method does the Java binding, bind to the C binding ( e.g. the perl binding uses SWIG... ) I'm diving into the Perl binding now, but its been about 4 years since I wrote it.... so it's gonna take me some time I think. PS It's really odd, the problem showed up in code I've put in regular use for a long time, I know bugs just don't introduce themselves, but this part of the binding worked fine, I haven't upraded SGE or Perl or recompiled Schedule::DRMAAc but the problem just appeared. I'm thinking the sysadmins ran up2date on my RH4 box and a dependency library to the C binding changed. But, if your Java binding is actively using it, then it would rule that out... ----- Original Message ----- From: "Daniel Templeton" <Dan.Templeton@Sun.COM> To: "Tim Harsch" <harsch1@llnl.gov> Cc: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 10:48 AM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
Tim,
Looks like something localized to the Perl binding or your configuration. I did the same test on the Java language binding, which is also based on the C binding, and it worked fine for me. Output below, program attached.
Could the problem be that you're sending the full command line as the remote command and "1" as the args, instead of "csh" as the remote command and "-c", "'exit 1'" as the args? What is the meaning of setting the args to "1"?
---
% java -cp /sge/lib/drmaa.jar:. -d64 Test Exited: true Aborted: false Signaled: false
---
Daniel
Tim Harsch wrote:
I don't understand why causing a simple non-zero exit status is causing drmaa_wifaborted to be set.
The easiest way for me to demo this is to change line 38 of t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be my $remote_cmd = "csh -c 'exit 1'";
And then running "make test TEST_VERBOSE=1", which would produce: <SNIP> ok 12 - drmaa_wait says jobid did not change? # Failed test (t/08_posix_tests.t at line 83) not ok 13 - drmaa_wait should say there is more info available in POSIX funcs ok 15 - drmaa_wifaborted error? # Failed test (t/08_posix_tests.t at line 90) not ok 16 - normal job should not abort. ok 17 - drmaa_wifexited returned 3 of 3 args ok 18 - drmaa_wifexited error? # Failed test (t/08_posix_tests.t at line 97) not ok 19 - normal job should exit. <SNIP>
I've attached test 8 to this email, in case you want to see how the calls are made in Perl.
Any ideas?
Thanks, Tim Harsch ------------------------------------------------------------------------
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
--------------------------------------------------------------------------------
import org.ggf.drmaa.*;
public class Test { public static void main(String[] args) throws Exception { Session s = SessionFactory.getFactory().getSession(); s.init(""); JobTemplate jt = s.createJobTemplate(); jt.setRemoteCommand("/usr/bin/csh"); jt.setArgs(new String[] {"-c", "'exit 1'"}); String job = s.runJob(jt); JobInfo ji = s.wait(job, s.TIMEOUT_WAIT_FOREVER); System.out.println("Exited: " + ji.hasExited()); System.out.println("Aborted: " + ji.wasAborted()); System.out.println("Signaled: " + ji.hasSignaled()); s.deleteJobTemplate(jt); s.exit(); } }
It uses JNI (Java Native Interface): http://blogs.sun.com/templedf/entry/porting_the_drmaa_java_language Rayson On 3/28/07, Tim Harsch <harsch1@llnl.gov> wrote:
Daniel, By what method does the Java binding, bind to the C binding ( e.g. the perl binding uses SWIG... )
I'm diving into the Perl binding now, but its been about 4 years since I wrote it.... so it's gonna take me some time I think.
PS It's really odd, the problem showed up in code I've put in regular use for a long time, I know bugs just don't introduce themselves, but this part of the binding worked fine, I haven't upraded SGE or Perl or recompiled Schedule::DRMAAc but the problem just appeared. I'm thinking the sysadmins ran up2date on my RH4 box and a dependency library to the C binding changed. But, if your Java binding is actively using it, then it would rule that out...
----- Original Message ----- From: "Daniel Templeton" <Dan.Templeton@Sun.COM> To: "Tim Harsch" <harsch1@llnl.gov> Cc: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 10:48 AM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
Tim,
Looks like something localized to the Perl binding or your configuration. I did the same test on the Java language binding, which is also based on the C binding, and it worked fine for me. Output below, program attached.
Could the problem be that you're sending the full command line as the remote command and "1" as the args, instead of "csh" as the remote command and "-c", "'exit 1'" as the args? What is the meaning of setting the args to "1"?
---
% java -cp /sge/lib/drmaa.jar:. -d64 Test Exited: true Aborted: false Signaled: false
---
Daniel
Tim Harsch wrote:
I don't understand why causing a simple non-zero exit status is causing drmaa_wifaborted to be set.
The easiest way for me to demo this is to change line 38 of t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be my $remote_cmd = "csh -c 'exit 1'";
And then running "make test TEST_VERBOSE=1", which would produce: <SNIP> ok 12 - drmaa_wait says jobid did not change? # Failed test (t/08_posix_tests.t at line 83) not ok 13 - drmaa_wait should say there is more info available in POSIX funcs ok 15 - drmaa_wifaborted error? # Failed test (t/08_posix_tests.t at line 90) not ok 16 - normal job should not abort. ok 17 - drmaa_wifexited returned 3 of 3 args ok 18 - drmaa_wifexited error? # Failed test (t/08_posix_tests.t at line 97) not ok 19 - normal job should exit. <SNIP>
I've attached test 8 to this email, in case you want to see how the calls are made in Perl.
Any ideas?
Thanks, Tim Harsch ------------------------------------------------------------------------
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
--------------------------------------------------------------------------------
import org.ggf.drmaa.*;
public class Test { public static void main(String[] args) throws Exception { Session s = SessionFactory.getFactory().getSession(); s.init(""); JobTemplate jt = s.createJobTemplate(); jt.setRemoteCommand("/usr/bin/csh"); jt.setArgs(new String[] {"-c", "'exit 1'"}); String job = s.runJob(jt); JobInfo ji = s.wait(job, s.TIMEOUT_WAIT_FOREVER); System.out.println("Exited: " + ji.hasExited()); System.out.println("Aborted: " + ji.wasAborted()); System.out.println("Signaled: " + ji.hasSignaled()); s.deleteJobTemplate(jt); s.exit(); } }
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
Thanks Rayson, as always, you're a great help! Well, I've narrowed down the problem. I was worried that Schedule::DRMAAc may not be working correctly, but now I'm not so sure... I think it may be specific to SGE. I noticed that on page 137 of the User's guide ( http://192.18.109.11/817-6117/817-6117.pdf ), it lists exit code 99 as having specific meaning w.r.t. rescheduling. It got me wondering if other exit codes have specific meaning, or are getting interpreted in some way I don't understand. So I wrote the two attached scripts, output below. As you can see: exit codes below 100 work as expected, exit code 100 returns wifaborted, and exit codes above 100 get mangled. (NOTE: I was having difficulty getting my previous method of using /bin/csh -c 'exit 100' to work as expected and so switched to a simple perl wrapper script [ also attached ] ) I think a valid next step would be to write this script in the Java binding and see what happens. [harsch1@xber1 DRMAA_JavaTest]$ Test.pl Test.pl Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1' to grid with Job ID: '85064' Exited: 1 Exit value: 1 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 60' to grid with Job ID: '85065' Exited: 1 Exit value: 60 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 80' to grid with Job ID: '85066' Exited: 1 Exit value: 80 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 100' to grid with Job ID: '85067' Exited: 0 Aborted: 1 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1000' to grid with Job ID: '85068' Exited: 1 Exit value: 232 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 10000' to grid with Job ID: '85069' Exited: 1 Exit value: 16 Aborted: 0 Signaled: 0 Thanks, Tim Harsch ----- Original Message ----- From: "Rayson Ho" <rayrayson@gmail.com> To: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 6:28 PM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
It uses JNI (Java Native Interface):
http://blogs.sun.com/templedf/entry/porting_the_drmaa_java_language
Rayson
On 3/28/07, Tim Harsch <harsch1@llnl.gov> wrote:
Daniel, By what method does the Java binding, bind to the C binding ( e.g. the perl binding uses SWIG... )
I'm diving into the Perl binding now, but its been about 4 years since I wrote it.... so it's gonna take me some time I think.
PS It's really odd, the problem showed up in code I've put in regular use for a long time, I know bugs just don't introduce themselves, but this part of the binding worked fine, I haven't upraded SGE or Perl or recompiled Schedule::DRMAAc but the problem just appeared. I'm thinking the sysadmins ran up2date on my RH4 box and a dependency library to the C binding changed. But, if your Java binding is actively using it, then it would rule that out...
----- Original Message ----- From: "Daniel Templeton" <Dan.Templeton@Sun.COM> To: "Tim Harsch" <harsch1@llnl.gov> Cc: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 10:48 AM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
Tim,
Looks like something localized to the Perl binding or your configuration. I did the same test on the Java language binding, which is also based on the C binding, and it worked fine for me. Output below, program attached.
Could the problem be that you're sending the full command line as the remote command and "1" as the args, instead of "csh" as the remote command and "-c", "'exit 1'" as the args? What is the meaning of setting the args to "1"?
---
% java -cp /sge/lib/drmaa.jar:. -d64 Test Exited: true Aborted: false Signaled: false
---
Daniel
Tim Harsch wrote:
I don't understand why causing a simple non-zero exit status is causing drmaa_wifaborted to be set.
The easiest way for me to demo this is to change line 38 of t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be my $remote_cmd = "csh -c 'exit 1'";
And then running "make test TEST_VERBOSE=1", which would produce: <SNIP> ok 12 - drmaa_wait says jobid did not change? # Failed test (t/08_posix_tests.t at line 83) not ok 13 - drmaa_wait should say there is more info available in POSIX funcs ok 15 - drmaa_wifaborted error? # Failed test (t/08_posix_tests.t at line 90) not ok 16 - normal job should not abort. ok 17 - drmaa_wifexited returned 3 of 3 args ok 18 - drmaa_wifexited error? # Failed test (t/08_posix_tests.t at line 97) not ok 19 - normal job should exit. <SNIP>
I've attached test 8 to this email, in case you want to see how the calls are made in Perl.
Any ideas?
Thanks, Tim Harsch ------------------------------------------------------------------------
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
--------------------------------------------------------------------------------
import org.ggf.drmaa.*;
public class Test { public static void main(String[] args) throws Exception { Session s = SessionFactory.getFactory().getSession(); s.init(""); JobTemplate jt = s.createJobTemplate(); jt.setRemoteCommand("/usr/bin/csh"); jt.setArgs(new String[] {"-c", "'exit 1'"}); String job = s.runJob(jt); JobInfo ji = s.wait(job, s.TIMEOUT_WAIT_FOREVER); System.out.println("Exited: " + ji.hasExited()); System.out.println("Aborted: " + ji.wasAborted()); System.out.println("Signaled: " + ji.hasSignaled()); s.deleteJobTemplate(jt); s.exit(); } }
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
In sge_shepherd(8): 100 Job script, prolog and epilog: When FORBID_APPERROR is not set in the configuration (see sge_conf(5)), the job gets requeued. Otherwise see "Other". On the other hand, on Unix (including Linux), there is a limit on how large the exit value can be (and exit code 1000 is invalid because it is too large): http://tldp.org/LDP/abs/html/exitcodes.html Rayson On 3/29/07, Tim Harsch <harsch1@llnl.gov> wrote:
Thanks Rayson, as always, you're a great help!
Well, I've narrowed down the problem. I was worried that Schedule::DRMAAc may not be working correctly, but now I'm not so sure... I think it may be specific to SGE. I noticed that on page 137 of the User's guide ( http://192.18.109.11/817-6117/817-6117.pdf ), it lists exit code 99 as having specific meaning w.r.t. rescheduling. It got me wondering if other exit codes have specific meaning, or are getting interpreted in some way I don't understand. So I wrote the two attached scripts, output below. As you can see: exit codes below 100 work as expected, exit code 100 returns wifaborted, and exit codes above 100 get mangled. (NOTE: I was having difficulty getting my previous method of using /bin/csh -c 'exit 100' to work as expected and so switched to a simple perl wrapper script [ also attached ] )
I think a valid next step would be to write this script in the Java binding and see what happens.
[harsch1@xber1 DRMAA_JavaTest]$ Test.pl Test.pl Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1' to grid with Job ID: '85064' Exited: 1 Exit value: 1 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 60' to grid with Job ID: '85065' Exited: 1 Exit value: 60 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 80' to grid with Job ID: '85066' Exited: 1 Exit value: 80 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 100' to grid with Job ID: '85067' Exited: 0 Aborted: 1 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1000' to grid with Job ID: '85068' Exited: 1 Exit value: 232 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 10000' to grid with Job ID: '85069' Exited: 1 Exit value: 16 Aborted: 0 Signaled: 0
Thanks, Tim Harsch
----- Original Message ----- From: "Rayson Ho" <rayrayson@gmail.com> To: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 6:28 PM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
It uses JNI (Java Native Interface):
http://blogs.sun.com/templedf/entry/porting_the_drmaa_java_language
Rayson
On 3/28/07, Tim Harsch <harsch1@llnl.gov> wrote:
Daniel, By what method does the Java binding, bind to the C binding ( e.g. the perl binding uses SWIG... )
I'm diving into the Perl binding now, but its been about 4 years since I wrote it.... so it's gonna take me some time I think.
PS It's really odd, the problem showed up in code I've put in regular use for a long time, I know bugs just don't introduce themselves, but this part of the binding worked fine, I haven't upraded SGE or Perl or recompiled Schedule::DRMAAc but the problem just appeared. I'm thinking the sysadmins ran up2date on my RH4 box and a dependency library to the C binding changed. But, if your Java binding is actively using it, then it would rule that out...
----- Original Message ----- From: "Daniel Templeton" <Dan.Templeton@Sun.COM> To: "Tim Harsch" <harsch1@llnl.gov> Cc: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 10:48 AM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
Tim,
Looks like something localized to the Perl binding or your configuration. I did the same test on the Java language binding, which is also based on the C binding, and it worked fine for me. Output below, program attached.
Could the problem be that you're sending the full command line as the remote command and "1" as the args, instead of "csh" as the remote command and "-c", "'exit 1'" as the args? What is the meaning of setting the args to "1"?
---
% java -cp /sge/lib/drmaa.jar:. -d64 Test Exited: true Aborted: false Signaled: false
---
Daniel
Tim Harsch wrote:
I don't understand why causing a simple non-zero exit status is causing drmaa_wifaborted to be set.
The easiest way for me to demo this is to change line 38 of t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be my $remote_cmd = "csh -c 'exit 1'";
And then running "make test TEST_VERBOSE=1", which would produce: <SNIP> ok 12 - drmaa_wait says jobid did not change? # Failed test (t/08_posix_tests.t at line 83) not ok 13 - drmaa_wait should say there is more info available in POSIX funcs ok 15 - drmaa_wifaborted error? # Failed test (t/08_posix_tests.t at line 90) not ok 16 - normal job should not abort. ok 17 - drmaa_wifexited returned 3 of 3 args ok 18 - drmaa_wifexited error? # Failed test (t/08_posix_tests.t at line 97) not ok 19 - normal job should exit. <SNIP>
I've attached test 8 to this email, in case you want to see how the calls are made in Perl.
Any ideas?
Thanks, Tim Harsch ------------------------------------------------------------------------
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
--------------------------------------------------------------------------------
import org.ggf.drmaa.*;
public class Test { public static void main(String[] args) throws Exception { Session s = SessionFactory.getFactory().getSession(); s.init(""); JobTemplate jt = s.createJobTemplate(); jt.setRemoteCommand("/usr/bin/csh"); jt.setArgs(new String[] {"-c", "'exit 1'"}); String job = s.runJob(jt); JobInfo ji = s.wait(job, s.TIMEOUT_WAIT_FOREVER); System.out.println("Exited: " + ji.hasExited()); System.out.println("Aborted: " + ji.wasAborted()); System.out.println("Signaled: " + ji.hasSignaled()); s.deleteJobTemplate(jt); s.exit(); } }
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
Rayson, Thanks again! exit cdoe 100 is the real culprit here. An application I wrote was being run via Schedule::DRMAAc and uses exit codes to communicate results back to caller, since we have only a few states worth noting from the calling script. We added one for exit code 100 not long ago, and it recently got put to the test in production. Trying values larger than 100 was naive on my part, I knew there was a limit but in my rush I was thinking it was 64K for some reason. Thanks so much for pointing me to both sets of documentation here, I've included them in our code as references to check before adding new exit code states. Very valuable, thanks! ----- Original Message ----- From: "Rayson Ho" <rayrayson@gmail.com> To: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Thursday, March 29, 2007 10:32 AM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
In sge_shepherd(8):
100 Job script, prolog and epilog: When FORBID_APPERROR is not set in the configuration (see sge_conf(5)), the job gets requeued. Otherwise see "Other".
On the other hand, on Unix (including Linux), there is a limit on how large the exit value can be (and exit code 1000 is invalid because it is too large):
http://tldp.org/LDP/abs/html/exitcodes.html
Rayson
On 3/29/07, Tim Harsch <harsch1@llnl.gov> wrote:
Thanks Rayson, as always, you're a great help!
Well, I've narrowed down the problem. I was worried that Schedule::DRMAAc may not be working correctly, but now I'm not so sure... I think it may be specific to SGE. I noticed that on page 137 of the User's guide ( http://192.18.109.11/817-6117/817-6117.pdf ), it lists exit code 99 as having specific meaning w.r.t. rescheduling. It got me wondering if other exit codes have specific meaning, or are getting interpreted in some way I don't understand. So I wrote the two attached scripts, output below. As you can see: exit codes below 100 work as expected, exit code 100 returns wifaborted, and exit codes above 100 get mangled. (NOTE: I was having difficulty getting my previous method of using /bin/csh -c 'exit 100' to work as expected and so switched to a simple perl wrapper script [ also attached ] )
I think a valid next step would be to write this script in the Java binding and see what happens.
[harsch1@xber1 DRMAA_JavaTest]$ Test.pl Test.pl Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1' to grid with Job ID: '85064' Exited: 1 Exit value: 1 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 60' to grid with Job ID: '85065' Exited: 1 Exit value: 60 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 80' to grid with Job ID: '85066' Exited: 1 Exit value: 80 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 100' to grid with Job ID: '85067' Exited: 0 Aborted: 1 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1000' to grid with Job ID: '85068' Exited: 1 Exit value: 232 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 10000' to grid with Job ID: '85069' Exited: 1 Exit value: 16 Aborted: 0 Signaled: 0
Thanks, Tim Harsch
----- Original Message ----- From: "Rayson Ho" <rayrayson@gmail.com> To: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 6:28 PM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
It uses JNI (Java Native Interface):
http://blogs.sun.com/templedf/entry/porting_the_drmaa_java_language
Rayson
On 3/28/07, Tim Harsch <harsch1@llnl.gov> wrote:
Daniel, By what method does the Java binding, bind to the C binding ( e.g. the perl binding uses SWIG... )
I'm diving into the Perl binding now, but its been about 4 years since I wrote it.... so it's gonna take me some time I think.
PS It's really odd, the problem showed up in code I've put in regular use for a long time, I know bugs just don't introduce themselves, but this part of the binding worked fine, I haven't upraded SGE or Perl or recompiled Schedule::DRMAAc but the problem just appeared. I'm thinking the sysadmins ran up2date on my RH4 box and a dependency library to the C binding changed. But, if your Java binding is actively using it, then it would rule that out...
----- Original Message ----- From: "Daniel Templeton" <Dan.Templeton@Sun.COM> To: "Tim Harsch" <harsch1@llnl.gov> Cc: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 10:48 AM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
Tim,
Looks like something localized to the Perl binding or your configuration. I did the same test on the Java language binding, which is also based on the C binding, and it worked fine for me. Output below, program attached.
Could the problem be that you're sending the full command line as the remote command and "1" as the args, instead of "csh" as the remote command and "-c", "'exit 1'" as the args? What is the meaning of setting the args to "1"?
---
% java -cp /sge/lib/drmaa.jar:. -d64 Test Exited: true Aborted: false Signaled: false
---
Daniel
Tim Harsch wrote:
I don't understand why causing a simple non-zero exit status is causing drmaa_wifaborted to be set.
The easiest way for me to demo this is to change line 38 of t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be my $remote_cmd = "csh -c 'exit 1'";
And then running "make test TEST_VERBOSE=1", which would produce: <SNIP> ok 12 - drmaa_wait says jobid did not change? # Failed test (t/08_posix_tests.t at line 83) not ok 13 - drmaa_wait should say there is more info available in POSIX funcs ok 15 - drmaa_wifaborted error? # Failed test (t/08_posix_tests.t at line 90) not ok 16 - normal job should not abort. ok 17 - drmaa_wifexited returned 3 of 3 args ok 18 - drmaa_wifexited error? # Failed test (t/08_posix_tests.t at line 97) not ok 19 - normal job should exit. <SNIP>
I've attached test 8 to this email, in case you want to see how the calls are made in Perl.
Any ideas?
Thanks, Tim Harsch ------------------------------------------------------------------------
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
--------------------------------------------------------------------------------
import org.ggf.drmaa.*;
public class Test { public static void main(String[] args) throws Exception { Session s = SessionFactory.getFactory().getSession(); s.init(""); JobTemplate jt = s.createJobTemplate(); jt.setRemoteCommand("/usr/bin/csh"); jt.setArgs(new String[] {"-c", "'exit 1'"}); String job = s.runJob(jt); JobInfo ji = s.wait(job, s.TIMEOUT_WAIT_FOREVER); System.out.println("Exited: " + ji.hasExited()); System.out.println("Aborted: " + ji.wasAborted()); System.out.println("Signaled: " + ji.hasSignaled()); s.deleteJobTemplate(jt); s.exit(); } }
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
Tim, Exit code 99 is for rescheduling. Exit code 100 indicates an error that prevents the job from running, hence the aborted thing. You can disable both of them using qmaster parameters. See the sge_conf man page. I'm a bit surprised that you're having trouble with exit codes over 100. See the following:
qsub -b y -sync y exit 18 Your job 257 ("exit 18") has been submitted Job 257 exited with exit code 18. qsub -b y -sync y exit 101 Your job 258 ("exit 101") has been submitted Job 258 exited with exit code 101. qsub -b y -sync y exit 248 Your job 259 ("exit 248") has been submitted Job 259 exited with exit code 248.
qsub and DRMAA are the same source base, so if it works for qsub, it should work for DRMAA. (qsub -sync y and DRMAA both call japi_wait().) Daniel Tim Harsch wrote:
Thanks Rayson, as always, you're a great help!
Well, I've narrowed down the problem. I was worried that Schedule::DRMAAc may not be working correctly, but now I'm not so sure... I think it may be specific to SGE. I noticed that on page 137 of the User's guide ( http://192.18.109.11/817-6117/817-6117.pdf ), it lists exit code 99 as having specific meaning w.r.t. rescheduling. It got me wondering if other exit codes have specific meaning, or are getting interpreted in some way I don't understand. So I wrote the two attached scripts, output below. As you can see: exit codes below 100 work as expected, exit code 100 returns wifaborted, and exit codes above 100 get mangled. (NOTE: I was having difficulty getting my previous method of using /bin/csh -c 'exit 100' to work as expected and so switched to a simple perl wrapper script [ also attached ] )
I think a valid next step would be to write this script in the Java binding and see what happens.
[harsch1@xber1 DRMAA_JavaTest]$ Test.pl Test.pl Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1' to grid with Job ID: '85064' Exited: 1 Exit value: 1 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 60' to grid with Job ID: '85065' Exited: 1 Exit value: 60 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 80' to grid with Job ID: '85066' Exited: 1 Exit value: 80 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 100' to grid with Job ID: '85067' Exited: 0 Aborted: 1 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 1000' to grid with Job ID: '85068' Exited: 1 Exit value: 232 Aborted: 0 Signaled: 0 Sent script '/home/harsch1/tmp/DRMAA_JavaTest/exit_script.pl 10000' to grid with Job ID: '85069' Exited: 1 Exit value: 16 Aborted: 0 Signaled: 0
Thanks, Tim Harsch
----- Original Message ----- From: "Rayson Ho" <rayrayson@gmail.com> To: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 6:28 PM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
It uses JNI (Java Native Interface):
http://blogs.sun.com/templedf/entry/porting_the_drmaa_java_language
Rayson
On 3/28/07, Tim Harsch <harsch1@llnl.gov> wrote:
Daniel, By what method does the Java binding, bind to the C binding ( e.g. the perl binding uses SWIG... )
I'm diving into the Perl binding now, but its been about 4 years since I wrote it.... so it's gonna take me some time I think.
PS It's really odd, the problem showed up in code I've put in regular use for a long time, I know bugs just don't introduce themselves, but this part of the binding worked fine, I haven't upraded SGE or Perl or recompiled Schedule::DRMAAc but the problem just appeared. I'm thinking the sysadmins ran up2date on my RH4 box and a dependency library to the C binding changed. But, if your Java binding is actively using it, then it would rule that out...
----- Original Message ----- From: "Daniel Templeton" <Dan.Templeton@Sun.COM> To: "Tim Harsch" <harsch1@llnl.gov> Cc: "DRMAA-WG" <drmaa-wg@gridforum.org> Sent: Wednesday, March 28, 2007 10:48 AM Subject: Re: [DRMAA-WG] normal exit status causes drmaa_wifaborted
Tim,
Looks like something localized to the Perl binding or your configuration. I did the same test on the Java language binding, which is also based on the C binding, and it worked fine for me. Output below, program attached.
Could the problem be that you're sending the full command line as the remote command and "1" as the args, instead of "csh" as the remote command and "-c", "'exit 1'" as the args? What is the meaning of setting the args to "1"?
---
% java -cp /sge/lib/drmaa.jar:. -d64 Test Exited: true Aborted: false Signaled: false
---
Daniel
Tim Harsch wrote:
I don't understand why causing a simple non-zero exit status is causing drmaa_wifaborted to be set.
The easiest way for me to demo this is to change line 38 of t/08_posix_tests.t of the Schedule::DRMAAc CPAN module to be my $remote_cmd = "csh -c 'exit 1'";
And then running "make test TEST_VERBOSE=1", which would produce: <SNIP> ok 12 - drmaa_wait says jobid did not change? # Failed test (t/08_posix_tests.t at line 83) not ok 13 - drmaa_wait should say there is more info available in POSIX funcs ok 15 - drmaa_wifaborted error? # Failed test (t/08_posix_tests.t at line 90) not ok 16 - normal job should not abort. ok 17 - drmaa_wifexited returned 3 of 3 args ok 18 - drmaa_wifexited error? # Failed test (t/08_posix_tests.t at line 97) not ok 19 - normal job should exit. <SNIP>
I've attached test 8 to this email, in case you want to see how the calls are made in Perl.
Any ideas?
Thanks, Tim Harsch
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
--------------------------------------------------------------------------------
import org.ggf.drmaa.*;
public class Test { public static void main(String[] args) throws Exception { Session s = SessionFactory.getFactory().getSession(); s.init(""); JobTemplate jt = s.createJobTemplate(); jt.setRemoteCommand("/usr/bin/csh"); jt.setArgs(new String[] {"-c", "'exit 1'"}); String job = s.runJob(jt); JobInfo ji = s.wait(job, s.TIMEOUT_WAIT_FOREVER); System.out.println("Exited: " + ji.hasExited()); System.out.println("Aborted: " + ji.wasAborted()); System.out.println("Signaled: " + ji.hasSignaled()); s.deleteJobTemplate(jt); s.exit(); } }
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
participants (3)
-
Daniel Templeton -
Rayson Ho -
Tim Harsch