wifexited and wifsignalled confusion continues
Hi all! Let me start with some background. On POSIX systems we have the following macros: WIFEXITED, WEXITSTATUS, WIFSIGNALED, WTERMSIG. A process that returns from main() may return an 8 bit value (0-255). In that case, evaluating with these macros the status returned by wait() we have: WIFEXITED != 0 WEXITSTATUS = returned value (exit() or return from main()) WIFSIGNALED = 0 WTERMSIG = 0 If the process ends because of a signal, we get: WIFEXITED = 0 WEXITSTATUS = 0 WIFSIGNALED != 0 WTERMSIG = signal number Now, we can get all that information if we fork() and wait() for the process. Things behave differently when we use a shell to start new processes. If the process started in the shell returns from main() we get the exact value (0-255). However, the problem is the convention that shell sets exit status of signaled process to be 128 + n where n is signal number. This makes it impossible to differentiate between process that happened to return a value >= 128 and process that was killed by a signal. Take a look at the following simple example: szalik@photon:/tmp$ cat foo.sh kill -6 $$ szalik@photon:/tmp$ ./foo.sh Aborted (core dumped) szalik@photon:/tmp$ echo $? 134 szalik@photon:/tmp$ cat bar.sh exit 134 szalik@photon:/tmp$ ./bar.sh szalik@photon:/tmp$ echo $? 134 Now, let's come to the point. As I've just tested, it seems to me that SGE doesn't use shell when submitting via DRMAA and is therefore able to correctly differentiate between exit status in the range of 128-255 and killing process by a signal. In this case, drmaa_wifexited and drmaa_wifsignalled work exactly the same way as WIFEXITED and WIFSIGNALED macros. It never happens that drmaa_wifexited = true and drmaa_wifsignaled = true at the same time. Interestingly, this seems to be against to the spec actually -- see next paragraphs. DRMAAs for LSF and PBS are different. What we internally have there is the exit status from the shell with all the consequences described before. So we can have this implemented in two different ways: - we only treat exit codes 0-127 as normall process termination and have wifexited = true and wexitstatus = 0..127, whereas codes >= 128 lead to wifexited = false and wifsignaled = true, wtermsig returns a computed signal (code - 128). The obvious problem though is that it makes codes >=128 returned from application unusable. - when we see code >= 128 we set wifexited = true and wifsignaled = true. wexitstatus gives us the "raw" code and wtermsig returns a computed signal (code - 128). Now comes the specification itself. It seems to make drmaa_wif* even more ambiguous, at least for me, having in mind that we're trying to follow POSIX model. For drmaa_wifexited we read: "Evaluates into 'exited' a non-zero value if stat was returned for a job that either failed after running or finished after running" So this means that if a process is signaled we get both wifexited = true and wifsignaled = true (this is tested in latest testsuite in ST_SUBMIT_KILL). What happens to wifexitstatus and wtermsig? What is the reason the "failed after running" part was added? In UNIX, process either exits with a status code or is signaled in which case it doesn't have a status code at all. And again, this is not that easy if we use a shell... -- Piotr Domagalski
Hi Piotr. This statement can be true with some implementations:
WIFEXITED != 0 WEXITSTATUS = returned value (exit() or return from main()) WIFSIGNALED = 0 WTERMSIG = 0
However if WIFSIGNALED output zero (false), calling WTERMSIG is undefined. Similarly, this statement can be true with some implementations:
WIFEXITED = 0 WEXITSTATUS = 0 WIFSIGNALED != 0 WTERMSIG = signal number
However, If WIFEXITED output zero (false), calling WEXITSTATUS is undefined. Yes, if a DRM uses a shell to start the client-specified program and the shell uses the convention of conveying that the child was terminated by exiting with 128+sigNum, then the DRM may not be able to distinguish between a child exit(137) and being terminated by sigNum=9. This is a DRM implementation issue. Yes, since a given process cannot exit itself and be terminated, WIFEXITED and WIFSIGNALED should never both be true (non-zero). Historically, a zero exit status from a Unix process meant "exited successfully". I believe the "failed after running" clause in the below excerpt is intended to mean exited with a non-zero value.
"Evaluates into 'exited' a non-zero value if stat was returned for a job that either failed after running or finished after running"
-Roger ----Original Message---- From: "Piotr Domagalski" <piotr.domagalski@man.poznan.pl> Sender: drmaa-wg-bounces@ogf.org To: "DRMAA Working Group" <drmaa-wg@gridforum.org> Subject: [DRMAA-WG] wifexited and wifsignalled confusion continues Date: Thu, 23 Oct 2008 00:42:23 +0200 Hi all! Let me start with some background. On POSIX systems we have the following macros: WIFEXITED, WEXITSTATUS, WIFSIGNALED, WTERMSIG. A process that returns from main() may return an 8 bit value (0-255). In that case, evaluating with these macros the status returned by wait() we have: WIFEXITED != 0 WEXITSTATUS = returned value (exit() or return from main()) WIFSIGNALED = 0 WTERMSIG = 0 If the process ends because of a signal, we get: WIFEXITED = 0 WEXITSTATUS = 0 WIFSIGNALED != 0 WTERMSIG = signal number Now, we can get all that information if we fork() and wait() for the process. Things behave differently when we use a shell to start new processes. If the process started in the shell returns from main() we get the exact value (0-255). However, the problem is the convention that shell sets exit status of signaled process to be 128 + n where n is signal number. This makes it impossible to differentiate between process that happened to return a value >= 128 and process that was killed by a signal. Take a look at the following simple example: szalik@photon:/tmp$ cat foo.sh kill -6 $$ szalik@photon:/tmp$ ./foo.sh Aborted (core dumped) szalik@photon:/tmp$ echo $? 134 szalik@photon:/tmp$ cat bar.sh exit 134 szalik@photon:/tmp$ ./bar.sh szalik@photon:/tmp$ echo $? 134 Now, let's come to the point. As I've just tested, it seems to me that SGE doesn't use shell when submitting via DRMAA and is therefore able to correctly differentiate between exit status in the range of 128-255 and killing process by a signal. In this case, drmaa_wifexited and drmaa_wifsignalled work exactly the same way as WIFEXITED and WIFSIGNALED macros. It never happens that drmaa_wifexited = true and drmaa_wifsignaled = true at the same time. Interestingly, this seems to be against to the spec actually -- see next paragraphs. DRMAAs for LSF and PBS are different. What we internally have there is the exit status from the shell with all the consequences described before. So we can have this implemented in two different ways: - we only treat exit codes 0-127 as normall process termination and have wifexited = true and wexitstatus = 0..127, whereas codes >= 128 lead to wifexited = false and wifsignaled = true, wtermsig returns a computed signal (code - 128). The obvious problem though is that it makes codes >=128 returned from application unusable. - when we see code >= 128 we set wifexited = true and wifsignaled = true. wexitstatus gives us the "raw" code and wtermsig returns a computed signal (code - 128). Now comes the specification itself. It seems to make drmaa_wif* even more ambiguous, at least for me, having in mind that we're trying to follow POSIX model. For drmaa_wifexited we read: "Evaluates into 'exited' a non-zero value if stat was returned for a job that either failed after running or finished after running" So this means that if a process is signaled we get both wifexited = true and wifsignaled = true (this is tested in latest testsuite in ST_SUBMIT_KILL). What happens to wifexitstatus and wtermsig? What is the reason the "failed after running" part was added? In UNIX, process either exits with a status code or is signaled in which case it doesn't have a status code at all. And again, this is not that easy if we use a shell... -- Piotr Domagalski -- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
Hi Roger, On Wed, Nov 5, 2008 at 10:41 PM, Roger Brobst <rogerb@cadence.com> wrote:
However if WIFSIGNALED output zero (false), calling WTERMSIG is undefined. [...] However, If WIFEXITED output zero (false), calling WEXITSTATUS is undefined.
Yes, I totally agree with that. I just oversimplified this a bit in my listings.
Yes, if a DRM uses a shell to start the client-specified program and the shell uses the convention of conveying that the child was terminated by exiting with 128+sigNum, then the DRM may not be able to distinguish between a child exit(137) and being terminated by sigNum=9. This is a DRM implementation issue.
I was actually hoping for some discussion as to how should DRMAA implementation should look like in this case. And also (that's mainly for Peter), what should the test suite look like. For example, now it tests exit statuses 0..255 which would obviously fail if we wanted to assume that drmaa_wifexited is true only for 0..128 and use the remaining values for signal numbers.
Yes, since a given process cannot exit itself and be terminated, WIFEXITED and WIFSIGNALED should never both be true (non-zero).
Are you talking about DRMAA's job here or just a general unix process? Because in the former case, there seems to be a differenet assumption, probably because of the "failed after running" case in wifexited. The thing is that current test suite (again, Peter?) tests whether a signalled DRMAA's job was both wifsignaled and wifexited. That kind of puzzled me.
Historically, a zero exit status from a Unix process meant "exited successfully". I believe the "failed after running" clause in the below excerpt is intended to mean exited with a non-zero value
"Evaluates into 'exited' a non-zero value if stat was returned for a job that either failed after running or finished after running"
The problem is that, as far as I understood Peter's intentions in the test suite, this "failed after running" clause is interpreted differently. -- Piotr Domagalski
Hi,
Yes, if a DRM uses a shell to start the client-specified program and the shell uses the convention of conveying that the child was terminated by exiting with 128+sigNum, then the DRM may not be able to distinguish between a child exit(137) and being terminated by sigNum=9. This is a DRM implementation issue.
I was actually hoping for some discussion as to how should DRMAA implementation should look like in this case. And also (that's mainly for Peter), what should the test suite look like. For example, now it tests exit statuses 0..255 which would obviously fail if we wanted to assume that drmaa_wifexited is true only for 0..128 and use the remaining values for signal numbers.
I am not the C binding expert, even though I am maintaining the test suite. Most test cases were originally written for SGE, and therefore could be way too specific. We already relaxed a lot of tests, in order to fit better to the spec itself. This sounds like just another case. If you guys agree on 128, we can put that in.
The thing is that current test suite (again, Peter?) tests whether a signalled DRMAA's job was both wifsignaled and wifexited. That kind of puzzled me.
This is a bug, and should have been fixed since test suite 1.6.0 (check the CHANGELOG). We had this discussion before. Please, send me a patch. Thanks for helping, Piotr ! /Peter.
On Thu, Nov 6, 2008 at 9:09 AM, Peter Tröger <peter@troeger.eu> wrote:
I am not the C binding expert, even though I am maintaining the test suite. Most test cases were originally written for SGE, and therefore could be way too specific. We already relaxed a lot of tests, in order to fit better to the spec itself. This sounds like just another case. If you guys agree on 128, we can put that in.
I don't think this has much to do with C binding. I see it as implementability of specification requirements... Therefore, in order to have more implementations pass the test suite, I would vote for limiting ST_EXIT_STATUS test to only codes <= 128. Then, it would be specific impl detail whether it supports obtaining 8 or 7 bit exit statuses. If DRMS uses shell to start the executable, it's not possible to have meaningful 8 bit exit code. It could also be worth noting in the specification document.
The thing is that current test suite (again, Peter?) tests whether a signalled DRMAA's job was both wifsignaled and wifexited. That kind of puzzled me.
This is a bug, and should have been fixed since test suite 1.6.0 (check the CHANGELOG). We had this discussion before. Please, send me a patch.
Lines 2203-2204 (part of ST_SUBMIT_KILL_SIG test) in trunk's test_drmaa.c: // DRMAA must tell us that the job was signalled and exited (see GFD 133) if (!check_term_details(stat, 0, 1, 1)) return 1; -- Piotr Domagalski
On Sun, Nov 9, 2008 at 10:19 PM, Piotr Domagalski <piotr.domagalski@man.poznan.pl> wrote:
On Thu, Nov 6, 2008 at 9:09 AM, Peter Tröger <peter@troeger.eu> wrote:
I am not the C binding expert, even though I am maintaining the test suite. Most test cases were originally written for SGE, and therefore could be way too specific. We already relaxed a lot of tests, in order to fit better to the spec itself. This sounds like just another case. If you guys agree on 128, we can put that in.
I don't think this has much to do with C binding. I see it as implementability of specification requirements...
Therefore, in order to have more implementations pass the test suite, I would vote for limiting ST_EXIT_STATUS test to only codes <= 128. Then, it would be specific impl detail whether it supports obtaining 8 or 7 bit exit statuses. If DRMS uses shell to start the executable, it's not possible to have meaningful 8 bit exit code.
It could also be worth noting in the specification document.
Would you mind relaxing it even more? I.e. to test only codes from 0 to 125? Reading "man 1posix exit": RATIONALE As explained in other sections, certain exit status values have been reserved for special uses and should be used by applications only for those purposes: 126 A file to be executed was found, but it was not an executable utility. 127 A utility to be executed was not found. >128 A command was interrupted by a signal. This way, we could interpret, at DRMAA implementation level, 126 and 127 exit codes so that the job would get DRMAA_PS_FAILED and drmaa_wifaborted() = true because of wrong executable, instead of getting exit status of 126 or 127 and leaving the interpretation up to the user. -- Piotr Domagalski
Hi Piotr. I believe during a drmaa teleconf (over a year ago) it was agreed that the single testcase which validates a wide range of exit codes should be split into two testcases (one for below 128, the other for above). I haven't had an opportunity to dig through the archives to substantiate my recollection. I think the suggestion to handle 126 and 127 specially deserves additional discussion ... but introduces its own issues: If the command is a shell script like: #!/bin/sh sleep 30 # or solve the world's problems exec /some/nonExistant/program I would expect the shell to exit with status 126 (because /some/nonExistant/program was not found). It would be incorrect for the parent of the shell to interpret this as 'job never started' since the shell could perform any number of tasks before the failed exec. -Roger ----Original Message---- From: "Piotr Domagalski" <piotr.domagalski@fedstage.com> Subject: Re: [DRMAA-WG] wifexited and wifsignalled confusion continues Date: Wed, 12 Nov 2008 13:10:19 +0100 Would you mind relaxing it even more? I.e. to test only codes from 0 to 125? Reading "man 1posix exit": RATIONALE As explained in other sections, certain exit status values have been reserved for special uses and should be used by applications only for those purposes: 126 A file to be executed was found, but it was not an executable utility. 127 A utility to be executed was not found. >128 A command was interrupted by a signal. This way, we could interpret, at DRMAA implementation level, 126 and 127 exit codes so that the job would get DRMAA_PS_FAILED and drmaa_wifaborted() = true because of wrong executable, instead of getting exit status of 126 or 127 and leaving the interpretation up to the user. -- Piotr Domagalski
Hi Roger, On Wed, Nov 12, 2008 at 4:26 PM, Roger Brobst <rogerb@cadence.com> wrote:
I believe during a drmaa teleconf (over a year ago) it was agreed that the single testcase which validates a wide range of exit codes should be split into two testcases (one for below 128, the other for above). I haven't had an opportunity to dig through the archives to substantiate my recollection.
It would be great to have that. I would prefer our implementation to pass the test suite smoothly so I'd probably vote only for testing 0-128 ;-) Anyway, event that minor change of splitting it into two tests would be great.
I think the suggestion to handle 126 and 127 specially deserves additional discussion ... but introduces its own issues:
If the command is a shell script like: #!/bin/sh sleep 30 # or solve the world's problems exec /some/nonExistant/program
I would expect the shell to exit with status 126 (because /some/nonExistant/program was not found).
It would be incorrect for the parent of the shell to interpret this as 'job never started' since the shell could perform any number of tasks before the failed exec.
Yes, that's true. To sum up, we need to be aware of two different cases: - DRM doesn't use shell to start the exec you specify in DRMAA_REMOTE_COMMAND. That's the case for SGE for example. When you tell DRMAA to start a non-existing program, you get DRMAA_PS_FAILED + aborted = true. When you tell DRMAA to start the above script, you get DRMAA_PS_DONE + exited = true + exitstatus = 126. It's possible to completely tell these two cases apart. - DRM does use shell to start the exec you specify in DRMAA_REMOTE_COMMAND. When you tell DRMAA to start a non-existing program, you internally get an exit status of 126. When you tell DRMAA to start the above script, you internally get an exit status of 126. Internally, these two cases look exactly the same for DRMAA implementator, so she has to decide whether to leave them as they are, or whether to *always* interpret 126+127 exit codes as DRMAA_PS_FAILED + aborted = true. I tend to agree that it would be much safer to leave it as it is -- i.e. to return DRMAA_PS_DONE and exitstatus = 126/127 in case of systems using shell to start the program (LSF, in our case). The interpretation, whether the code was returned because the main program (DRMAA_REMOTE_COMMAND) was not found or whether it returned the code (explicitly or because it was sth like the above script), should be left to the end user. -- Piotr Domagalski
participants (4)
-
Peter Tröger -
Piotr Domagalski -
Piotr Domagalski -
Roger Brobst