Re: [DRMAA-WG] wifexited and wifsignalled confusion continues

5 Nov 2008

      Hi Piotr.

This statement can be true with some implementations:
...
WIFEXITED != 0
 WEXITSTATUS = returned value (exit() or return from main())
 WIFSIGNALED = 0
 WTERMSIG = 0
However if WIFSIGNALED output zero (false), 
    calling WTERMSIG is undefined.

Similarly, this statement can be true with some implementations:
...
WIFEXITED = 0
 WEXITSTATUS = 0
 WIFSIGNALED != 0
 WTERMSIG = signal number
However, If WIFEXITED output zero (false),
    calling WEXITSTATUS is undefined.

Yes, if a DRM uses a shell to start the client-specified program
and the shell uses the convention of conveying that the child
was terminated by exiting with 128+sigNum, then the DRM may not
be able to distinguish between a child exit(137) and being 
terminated by sigNum=9.
This is a DRM implementation issue.

Yes, since a given process cannot exit itself and be terminated,
WIFEXITED and WIFSIGNALED should never both be true (non-zero).

Historically, a zero exit status from a Unix process meant
"exited successfully".
I believe the "failed after running" clause in the below 
excerpt is intended to mean exited with a non-zero value.
...
"Evaluates into 'exited' a non-zero value if stat was returned 
 for a job that either failed after running or finished after 
 running"
-Roger

----Original Message----
From: "Piotr Domagalski" <piotr.domagalski@man.poznan.pl>
Sender: drmaa-wg-bounces@ogf.org
To: "DRMAA Working Group" <drmaa-wg@gridforum.org>
Subject: [DRMAA-WG] wifexited and wifsignalled confusion continues
Date: Thu, 23 Oct 2008 00:42:23 +0200

Hi all!

Let me start with some background. On POSIX systems we have the
following macros: WIFEXITED, WEXITSTATUS, WIFSIGNALED, WTERMSIG. A
process that returns from main() may return an 8 bit value (0-255). In
that case, evaluating with these macros the status returned by wait()
we have:

WIFEXITED != 0
WEXITSTATUS = returned value (exit() or return from main())
WIFSIGNALED = 0
WTERMSIG = 0

If the process ends because of a signal, we get:

WIFEXITED = 0
WEXITSTATUS = 0
WIFSIGNALED != 0
WTERMSIG = signal number

Now, we can get all that information if we fork() and wait() for the
process. Things behave differently when we use a shell to start new
processes. If the process started in the shell returns from main() we
get the exact value (0-255). However, the problem is the convention
that shell sets exit status of signaled process to be 128 + n where n
is signal number. This makes it impossible to differentiate between
process that happened to return a value >= 128 and process that was
killed by a signal. Take a look at the following simple example:

szalik@photon:/tmp$ cat foo.sh
kill -6 $$
szalik@photon:/tmp$ ./foo.sh
Aborted (core dumped)
szalik@photon:/tmp$ echo $?
134

szalik@photon:/tmp$ cat bar.sh
exit 134
szalik@photon:/tmp$ ./bar.sh
szalik@photon:/tmp$ echo $?
134

Now, let's come to the point.

As I've just tested, it seems to me that SGE doesn't use shell when
submitting via DRMAA and is therefore able to correctly differentiate
between exit status in the range of 128-255 and killing process by a
signal. In this case, drmaa_wifexited and drmaa_wifsignalled work
exactly the same way as WIFEXITED and WIFSIGNALED macros. It never
happens that drmaa_wifexited = true and drmaa_wifsignaled = true at
the same time. Interestingly, this seems to be against to the spec
actually -- see next paragraphs.

DRMAAs for LSF and PBS are different. What we internally have there is
the exit status from the shell with all the consequences described
before. So we can have this implemented in two different ways:
- we only treat exit codes 0-127 as normall process termination and
have wifexited = true and wexitstatus = 0..127, whereas codes >= 128
lead to wifexited = false and wifsignaled = true, wtermsig returns a
computed signal (code - 128). The obvious problem though is that it
makes codes >=128 returned from application unusable.
- when we see code >= 128 we set wifexited = true and wifsignaled =
true. wexitstatus gives us the "raw" code and wtermsig returns a
computed signal (code - 128).

Now comes the specification itself. It seems to make drmaa_wif* even
more ambiguous, at least for me, having in mind that we're trying to
follow POSIX model. For drmaa_wifexited we read:

"Evaluates into 'exited' a non-zero value if stat was returned for a job
that either failed after running or finished after running"

So this means that if a process is signaled we get both wifexited =
true and wifsignaled = true (this is tested in latest testsuite in
ST_SUBMIT_KILL). What happens to wifexitstatus and wtermsig? What is
the reason the "failed after running" part was added? In UNIX, process
either exits with a status code or is signaled in which case it
doesn't have a status code at all. And again, this is not that easy if
we use a shell...

-- 
Piotr Domagalski
--
  drmaa-wg mailing list
  drmaa-wg@ogf.org
  http://www.ogf.org/mailman/listinfo/drmaa-wg

Re: [DRMAA-WG] wifexited and wifsignalled confusion continues

Roger Brobst