Hi Piotr. This statement can be true with some implementations:
WIFEXITED != 0 WEXITSTATUS = returned value (exit() or return from main()) WIFSIGNALED = 0 WTERMSIG = 0
However if WIFSIGNALED output zero (false), calling WTERMSIG is undefined. Similarly, this statement can be true with some implementations:
WIFEXITED = 0 WEXITSTATUS = 0 WIFSIGNALED != 0 WTERMSIG = signal number
However, If WIFEXITED output zero (false), calling WEXITSTATUS is undefined. Yes, if a DRM uses a shell to start the client-specified program and the shell uses the convention of conveying that the child was terminated by exiting with 128+sigNum, then the DRM may not be able to distinguish between a child exit(137) and being terminated by sigNum=9. This is a DRM implementation issue. Yes, since a given process cannot exit itself and be terminated, WIFEXITED and WIFSIGNALED should never both be true (non-zero). Historically, a zero exit status from a Unix process meant "exited successfully". I believe the "failed after running" clause in the below excerpt is intended to mean exited with a non-zero value.
"Evaluates into 'exited' a non-zero value if stat was returned for a job that either failed after running or finished after running"
-Roger ----Original Message---- From: "Piotr Domagalski" <piotr.domagalski@man.poznan.pl> Sender: drmaa-wg-bounces@ogf.org To: "DRMAA Working Group" <drmaa-wg@gridforum.org> Subject: [DRMAA-WG] wifexited and wifsignalled confusion continues Date: Thu, 23 Oct 2008 00:42:23 +0200 Hi all! Let me start with some background. On POSIX systems we have the following macros: WIFEXITED, WEXITSTATUS, WIFSIGNALED, WTERMSIG. A process that returns from main() may return an 8 bit value (0-255). In that case, evaluating with these macros the status returned by wait() we have: WIFEXITED != 0 WEXITSTATUS = returned value (exit() or return from main()) WIFSIGNALED = 0 WTERMSIG = 0 If the process ends because of a signal, we get: WIFEXITED = 0 WEXITSTATUS = 0 WIFSIGNALED != 0 WTERMSIG = signal number Now, we can get all that information if we fork() and wait() for the process. Things behave differently when we use a shell to start new processes. If the process started in the shell returns from main() we get the exact value (0-255). However, the problem is the convention that shell sets exit status of signaled process to be 128 + n where n is signal number. This makes it impossible to differentiate between process that happened to return a value >= 128 and process that was killed by a signal. Take a look at the following simple example: szalik@photon:/tmp$ cat foo.sh kill -6 $$ szalik@photon:/tmp$ ./foo.sh Aborted (core dumped) szalik@photon:/tmp$ echo $? 134 szalik@photon:/tmp$ cat bar.sh exit 134 szalik@photon:/tmp$ ./bar.sh szalik@photon:/tmp$ echo $? 134 Now, let's come to the point. As I've just tested, it seems to me that SGE doesn't use shell when submitting via DRMAA and is therefore able to correctly differentiate between exit status in the range of 128-255 and killing process by a signal. In this case, drmaa_wifexited and drmaa_wifsignalled work exactly the same way as WIFEXITED and WIFSIGNALED macros. It never happens that drmaa_wifexited = true and drmaa_wifsignaled = true at the same time. Interestingly, this seems to be against to the spec actually -- see next paragraphs. DRMAAs for LSF and PBS are different. What we internally have there is the exit status from the shell with all the consequences described before. So we can have this implemented in two different ways: - we only treat exit codes 0-127 as normall process termination and have wifexited = true and wexitstatus = 0..127, whereas codes >= 128 lead to wifexited = false and wifsignaled = true, wtermsig returns a computed signal (code - 128). The obvious problem though is that it makes codes >=128 returned from application unusable. - when we see code >= 128 we set wifexited = true and wifsignaled = true. wexitstatus gives us the "raw" code and wtermsig returns a computed signal (code - 128). Now comes the specification itself. It seems to make drmaa_wif* even more ambiguous, at least for me, having in mind that we're trying to follow POSIX model. For drmaa_wifexited we read: "Evaluates into 'exited' a non-zero value if stat was returned for a job that either failed after running or finished after running" So this means that if a process is signaled we get both wifexited = true and wifsignaled = true (this is tested in latest testsuite in ST_SUBMIT_KILL). What happens to wifexitstatus and wtermsig? What is the reason the "failed after running" part was added? In UNIX, process either exits with a status code or is signaled in which case it doesn't have a status code at all. And again, this is not that easy if we use a shell... -- Piotr Domagalski -- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg