Hi,
Cool. We have a hard deadline at the beginning of April - would it be possible to have something from you until this point in time ?
Sure, the document will be ready next week.
Sorry, no - the text seems to be misleading. drmaa_wifexited() informs you whether or not detailed exit information is available from the DRM. A 'normal termination' therefore includes an application termination through signaling. The term "normally" is a little bit confusing here.
There is already some discussion about improving the wif_ descriptions:
https://forge.gridforum.org/tracker/?aid=1125
We should take this as serious indication that this description needs even more improvement.
Thanks and best regards, Peter.
OK, Thank you very much for the explanation. Just to make sure I get it right. In the ST_INPUT_FILE_FAILURE test, the job is not able to run as there is no input file, and therefore there is no exit information. Is this an example of an abnormal termination?. Thanks again and best regards Ruben -- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+ GridWay, The Way to Grid! http://www.gridway.org ------------------------------------------------------- -- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+ GridWay, The Way to Grid! http://www.gridway.org
OK, Thank you very much for the explanation. Just to make sure I get it right. In the ST_INPUT_FILE_FAILURE test, the job is not able to run as there is no input file, and therefore there is no exit information. Is this an example of an abnormal termination?.
Sorry, no. Since the job never ran, there is also no termination - normal or abnormal. The ST_INPUT_FILE_FAILURE test case checks for a non-zero result from drmaa_wifaborted(), which indicates that the job was 'ended' before it entered the running state. I would expect that drmaa_wifexited() returns a zero value for "exited" in this case, since no further exit information can be available. ( Note: The testsuite assumes here that unusable input files are detected by the DRM before the job starts. This seems to be realistic, since file staging operations are usually not part of the job execution.) The DRMAA job status monitoring functions are a little bit confusing, sometimes even for the group members ;-) ... The best thing is to look at the code example from the C binding, which should explain most of the intended use cases for DRMAA functions. Best regards, Peter.
Hi,
Sorry, no. Since the job never ran, there is also no termination - normal or abnormal. The ST_INPUT_FILE_FAILURE test case checks for a non-zero result from drmaa_wifaborted(), which indicates that the job was 'ended' before it entered the running state. I would expect that drmaa_wifexited() returns a zero value for "exited" in this case, since no further exit information can be available.
Sorry, I do not agree. In the DRMS context, job life cycle comprises all the job execution stages since the job enters the DRM system. In this sense, whenever a job is submitted there should be a termination (either it actually ran or not). I can give you an example, if you submit a job (qsub) and then you kill it (qdel), it is obvious that the job terminated abnormally (it has been killed), although the job never entered the running state. There is no relation between if the job terminated normally and if there is no further information from the DRM. In the previous example (a job that has been killed) could or could not be more information from the DRMS. But in any case, it is clear that the job terminated abnormally. drmaa_wifexited description should concentrate in one aspect since there is no obvious (or general) relation between job termination and getting further information from DRM.
( Note: The testsuite assumes here that unusable input files are detected by the DRM before the job starts. This seems to be realistic, since file staging operations are usually not part of the job execution.)
I do not think so. Usually job preparation stages are part of the job execution, for example: PBS: performs rcp o scp of input/ouput files. The existence of these files are not checked at submission (i.e. the job is queued). In the situation of the ST_ERROR_*_FAILURE the jobs go through the following states: Q->R->E. This is, even if the input file does not exist the job goes through the running state. SGE: Also does not check input or output file existence or permissions. In fact, in the situation of the ST_ERROR_*_FAILURE the jobs go through the following states: qw->r->Eqw. This is, even if the input file does not exist the job goes through the running state. In this case you can use qalter command to redirect output paths. Globus: includes stage-in/stage-out steps in the GRAM protocol, file existence or not check at submission either. Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE and ST_ERROR_FILE_FAILURE from the official test suite. In the previous DRMs at least, you can submit a job with output file /etc/passwd or an unusable input file , the job is queued, runs and fails. This assumption should be agreed as it is not the default behavior of DRMs.
The DRMAA job status monitoring functions are a little bit confusing, sometimes even for the group members ;-) ... The best thing is to look at the code example from the C binding, which should explain most of the intended use cases for DRMAA functions.
Sure. The problem is that the code is not clear either. From DRMAA 1.0 C bindings example: ... drmaa_wifexited(&exited, stat, NULL, 0); if (exited) { drmaa_wexitstatus(&exit_status, stat, NULL, 0); printf("job \"%s\" finished regularly with exit status %d\n", all_jobids[pos], exit_status); } else { drmaa_wifsignaled(&signaled, stat, NULL, 0); if (signaled) { ... From this code it seems that a signaled job should end with a zero exited value from wifexited (as if it did not terminate normally), as opposed to your comments in the previous mails and the code in the DRMAA test suite.
Best regards, Peter.
Best Regards, Ruben -- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+ GridWay, The Way to Grid! http://www.gridway.org
Sorry, I do not agree. In the DRMS context, job life cycle comprises all the job execution stages since the job enters the DRM system. In this sense, whenever a job is submitted there should be a termination (either it actually ran or not). I can give you an example, if you submit a job (qsub) and then you kill it (qdel), it is obvious that the job terminated abnormally (it has been killed), although the job never entered the running state.
This is one possible interpretation, I agree. The DRMAA spec is aligned to POSIX semantics here - it is only possible to have something terminated which was running (== executed) before.
There is no relation between if the job terminated normally and if there is no further information from the DRM. In the previous example (a job that has been killed) could or could not be more information from the DRMS. But in any case, it is clear that the job terminated abnormally.
drmaa_wifexited description should concentrate in one aspect since there is no obvious (or general) relation between job termination and getting further information from DRM.
You are right. The main intention of drmaa_wifexited() is to tell you if additional information about the job execution ending is available. The final status of the job is provided by drmaa_job_ps(), and nothing else. The confusion might eventually be solvable by a slight reformulation of the first sentences in the drmaa_wif...() descriptions, in order to avoid the word "termination". This would not lead to a change of semantics. I have no good proposal - DRMAA group ?
( Note: The testsuite assumes here that unusable input files are detected by the DRM before the job starts. This seems to be realistic, since file staging operations are usually not part of the job execution.)
I do not think so. Usually job preparation stages are part of the job execution, for example: ... Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE and ST_ERROR_FILE_FAILURE from the official test suite. In the previous DRMs at least, you can submit a job with output file /etc/passwd or an unusable input file , the job is queued, runs and fails.
During the last phone call, the group went through the code. We agree to your impression that the 3 tests are currently not sufficient. The descriptions for "input / output / error stream" job template parameters says that an invalid value should result in the job state DRMAA_PS_FAILED - and nothing more. There is no description of what that means for drmaa_wif...() calls, but the testsuite expects a particular behavior. If you look at DRMAA section 2.6, it is clearly shown that DRMAA_PS_FAILED is possible both for queued and running jobs. Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ? BTW: Condor is one example for a system where the existence of input files is checked before the job is started. But at least your GRAM example convinced me that the opposite is also true ;-) ...
Sure. The problem is that the code is not clear either. From DRMAA 1.0 C bindings example: ... From this code it seems that a signaled job should end with a zero exited value from wifexited (as if it did not terminate normally), as opposed to your comments in the previous mails and the code in the DRMAA test suite.
You are right, as already said above. drmaa_wifexited() mainly indicates the availability of additional information. Regards, Peter.
Hi Peter, On Tuesday 21 March 2006 21:43, you wrote:
Sorry, I do not agree. In the DRMS context, job life cycle comprises all the job execution stages since the job enters the DRM system. In this sense, whenever a job is submitted there should be a termination (either it actually ran or not). I can give you an example, if you submit a job (qsub) and then you kill it (qdel), it is obvious that the job terminated abnormally (it has been killed), although the job never entered the running state.
This is one possible interpretation, I agree. The DRMAA spec is aligned to POSIX semantics here - it is only possible to have something terminated which was running (== executed) before.
OK!!
There is no relation between if the job terminated normally and if there is no further information from the DRM. In the previous example (a job that has been killed) could or could not be more information from the DRMS. But in any case, it is clear that the job terminated abnormally.
drmaa_wifexited description should concentrate in one aspect since there is no obvious (or general) relation between job termination and getting further information from DRM.
You are right. The main intention of drmaa_wifexited() is to tell you if additional information about the job execution ending is available. The final status of the job is provided by drmaa_job_ps(), and nothing else.
OK, We will fix the drmaa_wifexited() in GridWay DRMAA according to this.
The confusion might eventually be solvable by a slight reformulation of the first sentences in the drmaa_wif...() descriptions, in order to avoid the word "termination". This would not lead to a change of semantics.
I have no good proposal - DRMAA group ?
( Note: The testsuite assumes here that unusable input files are detected by the DRM before the job starts. This seems to be realistic, since file staging operations are usually not part of the job execution.)
I do not think so. Usually job preparation stages are part of the job execution, for example:
...
Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE and ST_ERROR_FILE_FAILURE from the official test suite. In the previous DRMs at least, you can submit a job with output file /etc/passwd or an unusable input file , the job is queued, runs and fails.
During the last phone call, the group went through the code. We agree to your impression that the 3 tests are currently not sufficient. The descriptions for "input / output / error stream" job template parameters says that an invalid value should result in the job state DRMAA_PS_FAILED - and nothing more. There is no description of what that means for drmaa_wif...() calls, but the testsuite expects a particular behavior. If you look at DRMAA section 2.6, it is clearly shown that DRMAA_PS_FAILED is possible both for queued and running jobs.
Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ?
Sure. It make sense for me also. There is also a validator in the state diagram (Section 2.6). I am just wondering if a DRMAA implementation could just reject the jobs in these tests at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
BTW: Condor is one example for a system where the existence of input files is checked before the job is started. But at least your GRAM example convinced me that the opposite is also true ;-) ...
Sure. The problem is that the code is not clear either. From DRMAA 1.0 C bindings example:
...
From this code it seems that a signaled job should end with a zero exited value from wifexited (as if it did not terminate normally), as opposed to your comments in the previous mails and the code in the DRMAA test suite.
You are right, as already said above. drmaa_wifexited() mainly indicates the availability of additional information.
OK
Regards, Peter.
Best Regards, Rubén -- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+ GridWay, The Way to Grid! http://www.gridway.org
Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ?
Sure. It make sense for me also.
There is also a validator in the state diagram (Section 2.6). I am just wondering if a DRMAA implementation could just reject the jobs in these tests at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
The spec is unclear here, since the description of the input / ouput / error parameters demands a particular job state - DRMAA_PS_FAILED. You can only have a job state when you have a job id. YOu can only have a job id when drmaa_run() was successfull. I really would like to have the opportunity of DRMAA_ERRNO_DENIED_BY_DRM also in this case, but then we have to relax the description of the according job template attributes. Sounds like another issue for the next phone call. Hrabri ? Regards, Peter.
Validator could reject "invalid" jobs before they get passed to a scheduler on the systems which have a job filter present. I do not know if it is possible to make all the DRM systems behave equally in this case, since the point of failure could happen at different stages of job submission/execution. I will put it on the agenda. Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf Of Peter Tröger Sent: Thursday, March 23, 2006 3:00 PM To: Ruben Santiago Montero Cc: DRMAA Working Group Subject: Re: [drmaa-wg] DRMAA TEST SUITE
Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ?
Sure. It make sense for me also.
There is also a validator in the state diagram (Section 2.6). I am just wondering if a DRMAA implementation could just reject the jobs in these tests at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
The spec is unclear here, since the description of the input / ouput / error parameters demands a particular job state - DRMAA_PS_FAILED. You can only have a job state when you have a job id. YOu can only have a job id when drmaa_run() was successfull. I really would like to have the opportunity of DRMAA_ERRNO_DENIED_BY_DRM also in this case, but then we have to relax the description of the according job template attributes.
Sounds like another issue for the next phone call. Hrabri ?
Regards, Peter.
Hi Ruben, Peter, It might be a good idea for two of you to check drama_wif* functions for correctness from your standpoint. Tracker 1125, https://forge.gridforum.org/tracker/?aid=1125 could explain the reasons for many changes those routine went thru. Attached is the up to date DRMAA spec. Thx Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf Of Ruben Santiago Montero Sent: Thursday, March 23, 2006 4:55 AM To: Peter Tröger Cc: DRMAA Working Group Subject: Re: [drmaa-wg] DRMAA TEST SUITE
Hi Peter,
On Tuesday 21 March 2006 21:43, you wrote:
Sorry, I do not agree. In the DRMS context, job life cycle comprises all the job execution stages since the job enters the DRM system. In this sense, whenever a job is submitted there should be a termination (either it actually ran or not). I can give you an example, if you submit a job (qsub) and then you kill it (qdel), it is obvious that the job terminated abnormally (it has been killed), although the job never entered the running state.
This is one possible interpretation, I agree. The DRMAA spec is aligned to POSIX semantics here - it is only possible to have something terminated which was running (== executed) before.
OK!!
There is no relation between if the job terminated normally and if
is no further information from the DRM. In the previous example (a job that has been killed) could or could not be more information from the DRMS. But in any case, it is clear that the job terminated abnormally.
drmaa_wifexited description should concentrate in one aspect since
there there
is no obvious (or general) relation between job termination and getting further information from DRM.
You are right. The main intention of drmaa_wifexited() is to tell you if additional information about the job execution ending is available. The final status of the job is provided by drmaa_job_ps(), and nothing else.
OK, We will fix the drmaa_wifexited() in GridWay DRMAA according to this.
The confusion might eventually be solvable by a slight reformulation of the first sentences in the drmaa_wif...() descriptions, in order to avoid the word "termination". This would not lead to a change of
semantics.
I have no good proposal - DRMAA group ?
( Note: The testsuite assumes here that unusable input files are detected by the DRM before the job starts. This seems to be
realistic,
since file staging operations are usually not part of the job execution.)
I do not think so. Usually job preparation stages are part of the job execution, for example:
...
Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE and ST_ERROR_FILE_FAILURE from the official test suite. In the previous DRMs at least, you can submit a job with output file /etc/passwd or an unusable input file , the job is queued, runs and fails.
During the last phone call, the group went through the code. We agree to your impression that the 3 tests are currently not sufficient. The descriptions for "input / output / error stream" job template parameters says that an invalid value should result in the job state DRMAA_PS_FAILED - and nothing more. There is no description of what that means for drmaa_wif...() calls, but the testsuite expects a particular behavior. If you look at DRMAA section 2.6, it is clearly shown that DRMAA_PS_FAILED is possible both for queued and running jobs.
Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ?
Sure. It make sense for me also.
There is also a validator in the state diagram (Section 2.6). I am just wondering if a DRMAA implementation could just reject the jobs in these tests at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
BTW: Condor is one example for a system where the existence of input files is checked before the job is started. But at least your GRAM example convinced me that the opposite is also true ;-) ...
Sure. The problem is that the code is not clear either. From DRMAA 1.0 C bindings example:
...
From this code it seems that a signaled job should end with a zero exited value from wifexited (as if it did not terminate normally), as opposed to your comments in the previous mails and the code in the DRMAA test suite.
You are right, as already said above. drmaa_wifexited() mainly indicates the availability of additional information.
OK
Regards, Peter.
Best Regards, Rubén -- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+
GridWay, The Way to Grid! http://www.gridway.org
I think both Ruben and me didn't like the statement about "normal termination": --- snip "Evaluates into 'exited', a non-zero value if stat was returned for a job that terminated normally. A zero value can also indicate that although the job has terminated normally an exit status is not available or that it is not known whether the job terminated normally. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information. A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()." -- snip We discussed that "normal" termination might have a completely different meaning in different DRM's. Therefore, DRMAA should only rely on it's own job state transition concept, instead of using new words such as "termination". A first rough proposal for a different text: -- snip "Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed (DRMAA_PS_FAILED) or finished (DRMAA_PS_DONE). More detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump(). A zero result for the 'exited' parameter either indicates that 1.) although the job is known to be ended more information is not available or 2.) that it is not known whether the job ended. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information." -- snip Just a proposal, the other fixes are fine. Peter. Hrabri Rajic schrieb:
Hi Ruben, Peter,
It might be a good idea for two of you to check drama_wif* functions for correctness from your standpoint. Tracker 1125, https://forge.gridforum.org/tracker/?aid=1125 could explain the reasons for many changes those routine went thru.
Attached is the up to date DRMAA spec.
Thx
Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf Of Ruben Santiago Montero Sent: Thursday, March 23, 2006 4:55 AM To: Peter Tröger Cc: DRMAA Working Group Subject: Re: [drmaa-wg] DRMAA TEST SUITE
Hi Peter,
On Tuesday 21 March 2006 21:43, you wrote:
Sorry, I do not agree. In the DRMS context, job life cycle comprises
all
the job execution stages since the job enters the DRM system. In this sense, whenever a job is submitted there should be a termination
(either
it actually ran or not). I can give you an example, if you submit a
job
(qsub) and then you kill it (qdel), it is obvious that the job
terminated
abnormally (it has been killed), although the job never entered the running state.
This is one possible interpretation, I agree. The DRMAA spec is aligned to POSIX semantics here - it is only possible to have something terminated which was running (== executed) before.
OK!!
There is no relation between if the job terminated normally and if
there
is no further information from the DRM. In the previous example (a job that has been killed) could or could not be more information from the DRMS. But in any case, it is clear that the job terminated
abnormally.
drmaa_wifexited description should concentrate in one aspect since
there
is no obvious (or general) relation between job termination and
getting
further information from DRM.
You are right. The main intention of drmaa_wifexited() is to tell you if additional information about the job execution ending is available. The final status of the job is provided by drmaa_job_ps(), and nothing else.
OK, We will fix the drmaa_wifexited() in GridWay DRMAA according to this.
The confusion might eventually be solvable by a slight reformulation of the first sentences in the drmaa_wif...() descriptions, in order to avoid the word "termination". This would not lead to a change of
semantics.
I have no good proposal - DRMAA group ?
( Note: The testsuite assumes here that unusable input files are detected by the DRM before the job starts. This seems to be
realistic,
since file staging operations are usually not part of the job execution.)
I do not think so. Usually job preparation stages are part of the job execution, for example:
...
Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE and ST_ERROR_FILE_FAILURE from the official
test
suite. In the previous DRMs at least, you can submit a job with output file /etc/passwd or an unusable input file , the job is queued, runs
and
fails.
During the last phone call, the group went through the code. We agree to your impression that the 3 tests are currently not sufficient. The descriptions for "input / output / error stream" job template parameters says that an invalid value should result in the job state DRMAA_PS_FAILED - and nothing more. There is no description of what that means for drmaa_wif...() calls, but the testsuite expects a particular behavior. If you look at DRMAA section 2.6, it is clearly shown that DRMAA_PS_FAILED is possible both for queued and running jobs.
Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ?
Sure. It make sense for me also.
There is also a validator in the state diagram (Section 2.6). I am just wondering if a DRMAA implementation could just reject the jobs in these tests at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
BTW: Condor is one example for a system where the existence of input files is checked before the job is started. But at least your GRAM example convinced me that the opposite is also true ;-) ...
Sure. The problem is that the code is not clear either. From DRMAA 1.0
C
bindings example:
...
From this code it seems that a signaled job should end with a zero
exited
value from wifexited (as if it did not terminate normally), as opposed
to
your comments in the previous mails and the code in the DRMAA test
suite.
You are right, as already said above. drmaa_wifexited() mainly indicates the availability of additional information.
OK
Regards, Peter.
Best Regards, Rubén -- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+
GridWay, The Way to Grid! http://www.gridway.org
Peter, Works for me. Daniel Peter Troeger wrote On 03/27/06 14:17,:
I think both Ruben and me didn't like the statement about "normal termination":
--- snip
"Evaluates into 'exited', a non-zero value if stat was returned for a job that terminated normally. A zero value can also indicate that although the job has terminated normally an exit status is not available or that it is not known whether the job terminated normally. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information. A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()."
-- snip
We discussed that "normal" termination might have a completely different meaning in different DRM's. Therefore, DRMAA should only rely on it's own job state transition concept, instead of using new words such as "termination". A first rough proposal for a different text:
-- snip
"Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed (DRMAA_PS_FAILED) or finished (DRMAA_PS_DONE). More detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump(). A zero result for the 'exited' parameter either indicates that 1.) although the job is known to be ended more information is not available or 2.) that it is not known whether the job ended. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information."
-- snip
Just a proposal, the other fixes are fine.
Peter.
Hrabri Rajic schrieb:
Hi Ruben, Peter,
It might be a good idea for two of you to check drama_wif* functions for correctness from your standpoint. Tracker 1125, https://forge.gridforum.org/tracker/?aid=1125 could explain the reasons for many changes those routine went thru.
Attached is the up to date DRMAA spec.
Thx
Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf Of Ruben Santiago Montero Sent: Thursday, March 23, 2006 4:55 AM To: Peter Tröger Cc: DRMAA Working Group Subject: Re: [drmaa-wg] DRMAA TEST SUITE
Hi Peter,
On Tuesday 21 March 2006 21:43, you wrote:
Sorry, I do not agree. In the DRMS context, job life cycle comprises
all
the job execution stages since the job enters the DRM system. In this sense, whenever a job is submitted there should be a termination
(either
it actually ran or not). I can give you an example, if you submit a
job
(qsub) and then you kill it (qdel), it is obvious that the job
terminated
abnormally (it has been killed), although the job never entered the running state.
This is one possible interpretation, I agree. The DRMAA spec is aligned to POSIX semantics here - it is only possible to have something terminated which was running (== executed) before.
OK!!
There is no relation between if the job terminated normally and if
there
is no further information from the DRM. In the previous example (a job that has been killed) could or could not be more information from the DRMS. But in any case, it is clear that the job terminated
abnormally.
drmaa_wifexited description should concentrate in one aspect since
there
is no obvious (or general) relation between job termination and
getting
further information from DRM.
You are right. The main intention of drmaa_wifexited() is to tell you if additional information about the job execution ending is available. The final status of the job is provided by drmaa_job_ps(), and nothing else.
OK, We will fix the drmaa_wifexited() in GridWay DRMAA according to this.
The confusion might eventually be solvable by a slight reformulation of the first sentences in the drmaa_wif...() descriptions, in order to avoid the word "termination". This would not lead to a change of
semantics.
I have no good proposal - DRMAA group ?
( Note: The testsuite assumes here that unusable input files are detected by the DRM before the job starts. This seems to be
realistic,
since file staging operations are usually not part of the job execution.)
I do not think so. Usually job preparation stages are part of the job execution, for example:
...
Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE and ST_ERROR_FILE_FAILURE from the official
test
suite. In the previous DRMs at least, you can submit a job with output file /etc/passwd or an unusable input file , the job is queued, runs
and
fails.
During the last phone call, the group went through the code. We agree to your impression that the 3 tests are currently not sufficient. The descriptions for "input / output / error stream" job template parameters says that an invalid value should result in the job state DRMAA_PS_FAILED - and nothing more. There is no description of what that means for drmaa_wif...() calls, but the testsuite expects a particular behavior. If you look at DRMAA section 2.6, it is clearly shown that DRMAA_PS_FAILED is possible both for queued and running jobs.
Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ?
Sure. It make sense for me also.
There is also a validator in the state diagram (Section 2.6). I am just wondering if a DRMAA implementation could just reject the jobs in these tests at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
BTW: Condor is one example for a system where the existence of input files is checked before the job is started. But at least your GRAM example convinced me that the opposite is also true ;-) ...
Sure. The problem is that the code is not clear either. From DRMAA 1.0
C
bindings example:
...
From this code it seems that a signaled job should end with a zero
exited
value from wifexited (as if it did not terminate normally), as opposed
to
your comments in the previous mails and the code in the DRMAA test
suite.
You are right, as already said above. drmaa_wifexited() mainly indicates the availability of additional information.
OK
Regards, Peter.
Best Regards, Rubén -- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+
GridWay, The Way to Grid! http://www.gridway.org
-- ****************************************************** * Daniel Templeton UMPK18 x83749 * * Staff Engineer, Sun N1 Grid Engine * ****************************************************** * "What's the sense in never thinkin' 'bout the tomb * * When you're much too busy returning to the womb?" * * -They Might Be Giants * ******************************************************
I totally agree with Peter Regards. Ruben On Monday 27 March 2006 14:17, you wrote:
I think both Ruben and me didn't like the statement about "normal termination":
--- snip
"Evaluates into 'exited', a non-zero value if stat was returned for a job that terminated normally. A zero value can also indicate that although the job has terminated normally an exit status is not available or that it is not known whether the job terminated normally. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information. A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()."
-- snip
We discussed that "normal" termination might have a completely different meaning in different DRM's. Therefore, DRMAA should only rely on it's own job state transition concept, instead of using new words such as "termination". A first rough proposal for a different text:
-- snip
"Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed (DRMAA_PS_FAILED) or finished (DRMAA_PS_DONE). More detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump(). A zero result for the 'exited' parameter either indicates that 1.) although the job is known to be ended more information is not available or 2.) that it is not known whether the job ended. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information."
-- snip
Just a proposal, the other fixes are fine.
Peter.
Hrabri Rajic schrieb:
Hi Ruben, Peter,
It might be a good idea for two of you to check drama_wif* functions for correctness from your standpoint. Tracker 1125, https://forge.gridforum.org/tracker/?aid=1125 could explain the reasons for many changes those routine went thru.
Attached is the up to date DRMAA spec.
Thx
Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf Of Ruben Santiago Montero Sent: Thursday, March 23, 2006 4:55 AM To: Peter Tröger Cc: DRMAA Working Group Subject: Re: [drmaa-wg] DRMAA TEST SUITE
Hi Peter,
On Tuesday 21 March 2006 21:43, you wrote:
Sorry, I do not agree. In the DRMS context, job life cycle comprises
all
the job execution stages since the job enters the DRM system. In this sense, whenever a job is submitted there should be a termination
(either
it actually ran or not). I can give you an example, if you submit a
job
(qsub) and then you kill it (qdel), it is obvious that the job
terminated
abnormally (it has been killed), although the job never entered the running state.
This is one possible interpretation, I agree. The DRMAA spec is aligned to POSIX semantics here - it is only possible to have something terminated which was running (== executed) before.
OK!!
There is no relation between if the job terminated normally and if
there
is no further information from the DRM. In the previous example (a job that has been killed) could or could not be more information from the DRMS. But in any case, it is clear that the job terminated
abnormally.
drmaa_wifexited description should concentrate in one aspect since
there
is no obvious (or general) relation between job termination and
getting
further information from DRM.
You are right. The main intention of drmaa_wifexited() is to tell you if additional information about the job execution ending is available. The final status of the job is provided by drmaa_job_ps(), and nothing else.
OK, We will fix the drmaa_wifexited() in GridWay DRMAA according to this.
The confusion might eventually be solvable by a slight reformulation of the first sentences in the drmaa_wif...() descriptions, in order to avoid the word "termination". This would not lead to a change of
semantics.
I have no good proposal - DRMAA group ?
( Note: The testsuite assumes here that unusable input files are detected by the DRM before the job starts. This seems to be
realistic,
since file staging operations are usually not part of the job execution.)
I do not think so. Usually job preparation stages are part of the job execution, for example:
...
Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE and ST_ERROR_FILE_FAILURE from the official
test
suite. In the previous DRMs at least, you can submit a job with output file /etc/passwd or an unusable input file , the job is queued, runs
and
fails.
During the last phone call, the group went through the code. We agree to your impression that the 3 tests are currently not sufficient. The descriptions for "input / output / error stream" job template parameters says that an invalid value should result in the job state DRMAA_PS_FAILED - and nothing more. There is no description of what that means for drmaa_wif...() calls, but the testsuite expects a particular behavior. If you look at DRMAA section 2.6, it is clearly shown that DRMAA_PS_FAILED is possible both for queued and running jobs.
Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ?
Sure. It make sense for me also.
There is also a validator in the state diagram (Section 2.6). I am just wondering if a DRMAA implementation could just reject the jobs in these tests at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
BTW: Condor is one example for a system where the existence of input files is checked before the job is started. But at least your GRAM example convinced me that the opposite is also true ;-) ...
Sure. The problem is that the code is not clear either. From DRMAA 1.0
C
bindings example:
...
From this code it seems that a signaled job should end with a zero
exited
value from wifexited (as if it did not terminate normally), as opposed
to
your comments in the previous mails and the code in the DRMAA test
suite.
You are right, as already said above. drmaa_wifexited() mainly indicates the availability of additional information.
OK
Regards, Peter.
Best Regards, Rubén -- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+
GridWay, The Way to Grid! http://www.gridway.org
-- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+ GridWay, The Way to Grid! http://www.gridway.org
The "terminated normally" terminology was borrowed from (and attributed to) the POSIX spec for wait3(). Although I'm not enamored with the terminology, I would be opposed to changing the semantics based upon:
[...] "normal" termination might have a completely different meaning in different DRM's
As I understand the proposed text for drmaa_wifexited,
"Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed (DRMAA_PS_FAILED) or finished (DRMAA_PS_DONE). if a job went directly from the "Queued" state to the "Failed" state (without entering the "Running" state), drmaa_wifexited would output non-zero ?
I'd be opposed to ~that~ ! It occurred to met that the "ended job" terminology might have been intended to disallow this situation ... but I discarded that thought since "ended job" is not in the job state transition diagram. -Roger In a previous e-mail, Peter Troeger wrote:
I think both Ruben and me didn't like the statement about "normal termination":
--- snip
"Evaluates into 'exited', a non-zero value if stat was returned for a job that terminated normally. A zero value can also indicate that although the job has terminated normally an exit status is not available or that it is not known whether the job terminated normally. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information. A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()."
-- snip
We discussed that "normal" termination might have a completely different meaning in different DRM's. Therefore, DRMAA should only rely on it's own job state transition concept, instead of using new words such as "termination". A first rough proposal for a different text:
-- snip
"Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed (DRMAA_PS_FAILED) or finished (DRMAA_PS_DONE). More detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump(). A zero result for the 'exited' parameter either indicates that 1.) although the job is known to be ended more information is not available or 2.) that it is not known whether the job ended. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information."
-- snip
Just a proposal, the other fixes are fine.
Peter.
Hrabri Rajic schrieb:
Hi Ruben, Peter,
It might be a good idea for two of you to check drama_wif* functions for correctness from your standpoint. Tracker 1125, https://forge.gridforum.org/tracker/?aid=1125 could explain the reasons for many changes those routine went thru.
Attached is the up to date DRMAA spec.
Thx
Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf Of Ruben Santiago Montero Sent: Thursday, March 23, 2006 4:55 AM To: Peter Tröger Cc: DRMAA Working Group Subject: Re: [drmaa-wg] DRMAA TEST SUITE
Hi Peter,
On Tuesday 21 March 2006 21:43, you wrote:
Sorry, I do not agree. In the DRMS context, job life cycle comprises
all
the job execution stages since the job enters the DRM system. In this sense, whenever a job is submitted there should be a termination
(either
it actually ran or not). I can give you an example, if you submit a
job
(qsub) and then you kill it (qdel), it is obvious that the job
terminated
abnormally (it has been killed), although the job never entered the running state.
This is one possible interpretation, I agree. The DRMAA spec is aligned to POSIX semantics here - it is only possible to have something terminated which was running (== executed) before.
OK!!
There is no relation between if the job terminated normally and if
there
is no further information from the DRM. In the previous example (a job that has been killed) could or could not be more information from the DRMS. But in any case, it is clear that the job terminated
abnormally.
drmaa_wifexited description should concentrate in one aspect since
there
is no obvious (or general) relation between job termination and
getting
further information from DRM.
You are right. The main intention of drmaa_wifexited() is to tell you if additional information about the job execution ending is available. The final status of the job is provided by drmaa_job_ps(), and nothing else.
OK, We will fix the drmaa_wifexited() in GridWay DRMAA according to this.
The confusion might eventually be solvable by a slight reformulation of the first sentences in the drmaa_wif...() descriptions, in order to avoid the word "termination". This would not lead to a change of
semantics.
I have no good proposal - DRMAA group ?
( Note: The testsuite assumes here that unusable input files are detected by the DRM before the job starts. This seems to be
realistic,
since file staging operations are usually not part of the job execution.)
I do not think so. Usually job preparation stages are part of the job execution, for example:
...
Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE and ST_ERROR_FILE_FAILURE from the official
test
suite. In the previous DRMs at least, you can submit a job with output file /etc/passwd or an unusable input file , the job is queued, runs
and
fails.
During the last phone call, the group went through the code. We agree to your impression that the 3 tests are currently not sufficient. The descriptions for "input / output / error stream" job template parameters says that an invalid value should result in the job state DRMAA_PS_FAILED - and nothing more. There is no description of what that means for drmaa_wif...() calls, but the testsuite expects a particular behavior. If you look at DRMAA section 2.6, it is clearly shown that DRMAA_PS_FAILED is possible both for queued and running jobs.
Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ?
Sure. It make sense for me also.
There is also a validator in the state diagram (Section 2.6). I am just wondering if a DRMAA implementation could just reject the jobs in these tests at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
BTW: Condor is one example for a system where the existence of input files is checked before the job is started. But at least your GRAM example convinced me that the opposite is also true ;-) ...
Sure. The problem is that the code is not clear either. From DRMAA 1.0
C
bindings example:
...
From this code it seems that a signaled job should end with a zero
exited
value from wifexited (as if it did not terminate normally), as opposed
to
your comments in the previous mails and the code in the DRMAA test
suite.
You are right, as already said above. drmaa_wifexited() mainly indicates the availability of additional information.
OK
Regards, Peter.
Best Regards, Rubén -- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+
GridWay, The Way to Grid! http://www.gridway.org
The "terminated normally" terminology was borrowed from (and attributed to) the POSIX spec for wait3(). Although I'm not enamored with the terminology, I would be opposed to changing the semantics based upon:
[...] "normal" termination might have a completely different meaning in different DRM's
As I understand the proposed text for drmaa_wifexited,
"Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed (DRMAA_PS_FAILED) or finished (DRMAA_PS_DONE). if a job went directly from the "Queued" state to the "Failed" state (without entering the "Running" state), drmaa_wifexited would output non-zero ?
I'd be opposed to ~that~ !
It occurred to met that the "ended job" terminology might have been intended to disallow this situation ...
Arrggh - correct. PS_FAILED could mean both things. What about this text, does it still reflect the original POSIX idea (and good english): "Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed after running or finished after running (see section 2.6). More detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump(). A zero result for the 'exited' parameter either indicates that 1.) although it is known that the job was running, more information is not available or 2.) that it is not known whether the job was running. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information." Regards, Peter.
In a previous e-mail, Peter Troeger wrote:
I think both Ruben and me didn't like the statement about "normal termination":
--- snip
"Evaluates into 'exited', a non-zero value if stat was returned for a job that terminated normally. A zero value can also indicate that although the job has terminated normally an exit status is not available or that it is not known whether the job terminated normally. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information. A non-zero 'exited' value indicates more detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump()."
-- snip
We discussed that "normal" termination might have a completely different meaning in different DRM's. Therefore, DRMAA should only rely on it's own job state transition concept, instead of using new words such as "termination". A first rough proposal for a different text:
-- snip
"Evaluates into 'exited' a non-zero value if stat was returned for a ended job that either failed (DRMAA_PS_FAILED) or finished (DRMAA_PS_DONE). More detailed diagnosis can be provided by means of drmaa_wifsignaled(), drmaa_wtermsig(),drmaa_wexitstatus(), and drmaa_wcoredump(). A zero result for the 'exited' parameter either indicates that 1.) although the job is known to be ended more information is not available or 2.) that it is not known whether the job ended. In both cases drmaa_wexitstatus() SHALL NOT provide exit status information."
-- snip
Just a proposal, the other fixes are fine.
Peter.
Hrabri Rajic schrieb:
Hi Ruben, Peter,
It might be a good idea for two of you to check drama_wif* functions for correctness from your standpoint. Tracker 1125, https://forge.gridforum.org/tracker/?aid=1125 could explain the reasons for many changes those routine went thru.
Attached is the up to date DRMAA spec.
Thx
Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf Of Ruben Santiago Montero Sent: Thursday, March 23, 2006 4:55 AM To: Peter Tröger Cc: DRMAA Working Group Subject: Re: [drmaa-wg] DRMAA TEST SUITE
Hi Peter,
On Tuesday 21 March 2006 21:43, you wrote:
Sorry, I do not agree. In the DRMS context, job life cycle comprises
all
the job execution stages since the job enters the DRM system. In this sense, whenever a job is submitted there should be a termination
(either
it actually ran or not). I can give you an example, if you submit a
job
(qsub) and then you kill it (qdel), it is obvious that the job
terminated
abnormally (it has been killed), although the job never entered the running state.
This is one possible interpretation, I agree. The DRMAA spec is aligned to POSIX semantics here - it is only possible to have something terminated which was running (== executed) before.
OK!!
There is no relation between if the job terminated normally and if
there
is no further information from the DRM. In the previous example (a job that has been killed) could or could not be more information from the DRMS. But in any case, it is clear that the job terminated
abnormally.
drmaa_wifexited description should concentrate in one aspect since
there
is no obvious (or general) relation between job termination and
getting
further information from DRM.
You are right. The main intention of drmaa_wifexited() is to tell you if additional information about the job execution ending is available. The final status of the job is provided by drmaa_job_ps(), and nothing else.
OK, We will fix the drmaa_wifexited() in GridWay DRMAA according to this.
The confusion might eventually be solvable by a slight reformulation of the first sentences in the drmaa_wif...() descriptions, in order to avoid the word "termination". This would not lead to a change of
semantics.
I have no good proposal - DRMAA group ?
> ( Note: The testsuite assumes here that unusable input files are > detected by the DRM before the job starts. This seems to be
realistic,
> since file staging operations are usually not part of the job > execution.)
I do not think so. Usually job preparation stages are part of the job execution, for example:
...
Therefore I suggest removing the ST_ERROR_INPUT_FAIURE, ST_ERROR_FILE_FAILURE and ST_ERROR_FILE_FAILURE from the official
test
suite. In the previous DRMs at least, you can submit a job with output file /etc/passwd or an unusable input file , the job is queued, runs
and
fails.
During the last phone call, the group went through the code. We agree to your impression that the 3 tests are currently not sufficient. The descriptions for "input / output / error stream" job template parameters says that an invalid value should result in the job state DRMAA_PS_FAILED - and nothing more. There is no description of what that means for drmaa_wif...() calls, but the testsuite expects a particular behavior. If you look at DRMAA section 2.6, it is clearly shown that DRMAA_PS_FAILED is possible both for queued and running jobs.
Our proposal is to remove the call of drmaa_wifaborted() for ST_INPUT_FILE_FAILURE / ST_ERROR_FILE_FAILURE / ST_OUTPUT_FILE_FAILURE. The drmaa_wait() call does not hurt (since all submitted jobs must be waitable), but the crucial part is the testing for the result of drmaa_synchronize(). After this change, I would expect the test cases to be successful also on your system. In case of malicious input / output / error files, the DRMAA implementation would only be expected to state a job failure. This should work for all GridWay-supported systems, right ? Could you accept this proposal ?
Sure. It make sense for me also.
There is also a validator in the state diagram (Section 2.6). I am just wondering if a DRMAA implementation could just reject the jobs in these tests at submission with a DRMAA_ERRNO_DENIED_BY_DRM.
BTW: Condor is one example for a system where the existence of input files is checked before the job is started. But at least your GRAM example convinced me that the opposite is also true ;-) ...
Sure. The problem is that the code is not clear either. From DRMAA 1.0
C
bindings example:
...
From this code it seems that a signaled job should end with a zero
exited
value from wifexited (as if it did not terminate normally), as opposed
to
your comments in the previous mails and the code in the DRMAA test
suite.
You are right, as already said above. drmaa_wifexited() mainly indicates the availability of additional information.
OK
Regards, Peter.
Best Regards, Rubén -- +-----------------------------------------------------------+ Dr. Ruben Santiago Montero Assistant Professor Dpto. Arquitectura de Computadores y Automatica Facultad de Informatica Universidad Complutense phone : +34 91 394 75 38 28040 Madrid fax : +34 91 394 75 27 Spain email : rubensm@dacya.ucm.es http://asds.dacya.ucm.es/ +-----------------------------------------------------------+
GridWay, The Way to Grid! http://www.gridway.org
participants (6)
-
Daniel Templeton -
Hrabri Rajic -
Peter Troeger -
Peter Tröger -
Roger Brobst -
Ruben Santiago Montero