[DRMAA-WG] wifexited and wifsignalled confusion continues

22 Oct 2008

      Hi all!

Let me start with some background. On POSIX systems we have the
following macros: WIFEXITED, WEXITSTATUS, WIFSIGNALED, WTERMSIG. A
process that returns from main() may return an 8 bit value (0-255). In
that case, evaluating with these macros the status returned by wait()
we have:

WIFEXITED != 0
WEXITSTATUS = returned value (exit() or return from main())
WIFSIGNALED = 0
WTERMSIG = 0

If the process ends because of a signal, we get:

WIFEXITED = 0
WEXITSTATUS = 0
WIFSIGNALED != 0
WTERMSIG = signal number

Now, we can get all that information if we fork() and wait() for the
process. Things behave differently when we use a shell to start new
processes. If the process started in the shell returns from main() we
get the exact value (0-255). However, the problem is the convention
that shell sets exit status of signaled process to be 128 + n where n
is signal number. This makes it impossible to differentiate between
process that happened to return a value >= 128 and process that was
killed by a signal. Take a look at the following simple example:

szalik@photon:/tmp$ cat foo.sh
kill -6 $$
szalik@photon:/tmp$ ./foo.sh
Aborted (core dumped)
szalik@photon:/tmp$ echo $?
134

szalik@photon:/tmp$ cat bar.sh
exit 134
szalik@photon:/tmp$ ./bar.sh
szalik@photon:/tmp$ echo $?
134

Now, let's come to the point.

As I've just tested, it seems to me that SGE doesn't use shell when
submitting via DRMAA and is therefore able to correctly differentiate
between exit status in the range of 128-255 and killing process by a
signal. In this case, drmaa_wifexited and drmaa_wifsignalled work
exactly the same way as WIFEXITED and WIFSIGNALED macros. It never
happens that drmaa_wifexited = true and drmaa_wifsignaled = true at
the same time. Interestingly, this seems to be against to the spec
actually -- see next paragraphs.

DRMAAs for LSF and PBS are different. What we internally have there is
the exit status from the shell with all the consequences described
before. So we can have this implemented in two different ways:
- we only treat exit codes 0-127 as normall process termination and
have wifexited = true and wexitstatus = 0..127, whereas codes >= 128
lead to wifexited = false and wifsignaled = true, wtermsig returns a
computed signal (code - 128). The obvious problem though is that it
makes codes >=128 returned from application unusable.
- when we see code >= 128 we set wifexited = true and wifsignaled =
true. wexitstatus gives us the "raw" code and wtermsig returns a
computed signal (code - 128).

Now comes the specification itself. It seems to make drmaa_wif* even
more ambiguous, at least for me, having in mind that we're trying to
follow POSIX model. For drmaa_wifexited we read:

"Evaluates into 'exited' a non-zero value if stat was returned for a job
that either failed after running or finished after running"

So this means that if a process is signaled we get both wifexited =
true and wifsignaled = true (this is tested in latest testsuite in
ST_SUBMIT_KILL). What happens to wifexitstatus and wtermsig? What is
the reason the "failed after running" part was added? In UNIX, process
either exits with a status code or is signaled in which case it
doesn't have a status code at all. And again, this is not that easy if
we use a shell...

-- 
Piotr Domagalski