Hi all! Let me start with some background. On POSIX systems we have the following macros: WIFEXITED, WEXITSTATUS, WIFSIGNALED, WTERMSIG. A process that returns from main() may return an 8 bit value (0-255). In that case, evaluating with these macros the status returned by wait() we have: WIFEXITED != 0 WEXITSTATUS = returned value (exit() or return from main()) WIFSIGNALED = 0 WTERMSIG = 0 If the process ends because of a signal, we get: WIFEXITED = 0 WEXITSTATUS = 0 WIFSIGNALED != 0 WTERMSIG = signal number Now, we can get all that information if we fork() and wait() for the process. Things behave differently when we use a shell to start new processes. If the process started in the shell returns from main() we get the exact value (0-255). However, the problem is the convention that shell sets exit status of signaled process to be 128 + n where n is signal number. This makes it impossible to differentiate between process that happened to return a value >= 128 and process that was killed by a signal. Take a look at the following simple example: szalik@photon:/tmp$ cat foo.sh kill -6 $$ szalik@photon:/tmp$ ./foo.sh Aborted (core dumped) szalik@photon:/tmp$ echo $? 134 szalik@photon:/tmp$ cat bar.sh exit 134 szalik@photon:/tmp$ ./bar.sh szalik@photon:/tmp$ echo $? 134 Now, let's come to the point. As I've just tested, it seems to me that SGE doesn't use shell when submitting via DRMAA and is therefore able to correctly differentiate between exit status in the range of 128-255 and killing process by a signal. In this case, drmaa_wifexited and drmaa_wifsignalled work exactly the same way as WIFEXITED and WIFSIGNALED macros. It never happens that drmaa_wifexited = true and drmaa_wifsignaled = true at the same time. Interestingly, this seems to be against to the spec actually -- see next paragraphs. DRMAAs for LSF and PBS are different. What we internally have there is the exit status from the shell with all the consequences described before. So we can have this implemented in two different ways: - we only treat exit codes 0-127 as normall process termination and have wifexited = true and wexitstatus = 0..127, whereas codes >= 128 lead to wifexited = false and wifsignaled = true, wtermsig returns a computed signal (code - 128). The obvious problem though is that it makes codes >=128 returned from application unusable. - when we see code >= 128 we set wifexited = true and wifsignaled = true. wexitstatus gives us the "raw" code and wtermsig returns a computed signal (code - 128). Now comes the specification itself. It seems to make drmaa_wif* even more ambiguous, at least for me, having in mind that we're trying to follow POSIX model. For drmaa_wifexited we read: "Evaluates into 'exited' a non-zero value if stat was returned for a job that either failed after running or finished after running" So this means that if a process is signaled we get both wifexited = true and wifsignaled = true (this is tested in latest testsuite in ST_SUBMIT_KILL). What happens to wifexitstatus and wtermsig? What is the reason the "failed after running" part was added? In UNIX, process either exits with a status code or is signaled in which case it doesn't have a status code at all. And again, this is not that easy if we use a shell... -- Piotr Domagalski