RE: [drmaa-wg] Synchronizing Against Waited Jobs
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf Of Daniel Templeton Sent: Wednesday, October 19, 2005 11:37 AM To: DRMAA Working Group Subject: [drmaa-wg] Synchronizing Against Waited Jobs
We have found a bug in the SGE DRMAA implementation, (I know! It's shocking!) but Andreas and I can't agree on what the fix should be. The issue is that in the current implementation, synchronizing against jobs that did not come from the current session returns DRMAA_ERRNO_SUCCESS. The part about which we disagree is what should happen when synchronizing against jobs that are from the current session, but that have already ended and have already had drmaa_wait() (or drmaa_synchronize() with dispose=true) called against them.
My stance is that one can extrapolate from the drmaa_wait() function that there is no difference between jobs which don't exist (at all or in the current session) and jobs whose exit information has been disposed (via drmaa_wait() or drmaa_synchronize()). Therefore, calling drmaa_synchronize() on jobs which have already had drmaa_wait() called against them should return DRMAA_ERRNO_INVALID_JOB.
Andreas holds that it can be inferred from the lack of the above statement in the spec, that drmaa_synchronize() handles such jobs differently from drmaa_wait(). Because drmaa_synchronize() does not need the jobs' exit information to succeed, it should be able to operate on jobs whose exit information has already been disposed. Therefore, calling drmaa_synchronize() on jobs which have already had drmaa_wait() called against them should return DRMAA_ERRNO_SUCCESS.
I can agree that Andreas' position makes theoretical sense, but I believe it runs contrary to the stated goal of minimizing the requirements on the implementing DRMS. In order to implement a drmaa_synchronize() that can distinguish between job's that have been disposed and jobs that never existed, the DRMAA implementation must keep a list of the ids of every job that has ever been submitted in the current session, and with every drmaa_synchronize() call, the list must be searched to validate the synchronize id list. And for what? DRMAA_JOB_IDS_ALL covers every case I can think of where the behavior Andreas described would be useful. To me, it sounds like a lot of extra work for the DRMAA implementation with no tangible benefit.
On what Andreas and I can agree is that if we decide he is right, we will close the bug as "won't fix" because the fix will be worse than
My wig is in dry cleaning. Nevertheless, here is my short take on this. If an implementation has handy job_id's it could conveniently make good determination which jobs are invalid (do not exist) and throw DRMAA_ERRNO_INVALID_JOB. IMHO, it is not a big deal if the routine gives imprecise diagnostics if it is forced to do memory garbage collection earlier. Quality of implementation term comes to mind, but that quality could come at the expense of being memory hog that in turn could lead to paging - quite dubious. See, the implementations might differently handle jobs that did not come from the current session, so we could not be precise here either. The important thing for the user is to synchronize i.e. block program from continuing if there are running remote jobs. Dispose = true helps get rid of the rusage info to free DRMAA implementations of heavy memory requirements when it matters, so keeping all the past job_ids for providing precise exit errors runs contrary to the goal of lessening memory requirements in the same routine. My 2 pfennigs, Hrabri the
bug. In any case, we should probably have a tracker item to make the final decision explicit in the spec.
What say you, oh, wise ones?
Daniel
-- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "So let the sunshine in. Face it with a grin. * * Smilers never lose, and frowners never win." * * -Let the Sunshine In, Pebbles Flintstone * ***************************************************
I support the argumentation of Hrabri. DRMAA introduced "dispose=true" in the interface, so resource consumption seems to be an issue. If a job was subject to drmaa_wait(), and the data was disposed, nothing should be left in memory about this job. IMHO the job becomes completely unknown to the library after this point. BTW, this holds also for the current Condor DRMAA implementation. It is also reasoned by the behavior of the underlying Condor system. If a job was finished, only the log files can tell you what happened. The Condor DRMAA library uses such a log file for each job, and if you execute drmaa_wait(dispose=true), the log file and in-memory structures for the job are removed. Calling drmaa_synchronize() after this results in DRMAA_ERRNO_INVALID_JOB. Things might be clearer if we would have an explicit drmaa_dispose_job() function. Regards, Peter. Rajic, Hrabri schrieb:
My wig is in dry cleaning. Nevertheless, here is my short take on this.
If an implementation has handy job_id's it could conveniently make good determination which jobs are invalid (do not exist) and throw DRMAA_ERRNO_INVALID_JOB. IMHO, it is not a big deal if the routine gives imprecise diagnostics if it is forced to do memory garbage collection earlier. Quality of implementation term comes to mind, but that quality could come at the expense of being memory hog that in turn could lead to paging - quite dubious. See, the implementations might differently handle jobs that did not come from the current session, so we could not be precise here either.
The important thing for the user is to synchronize i.e. block program from continuing if there are running remote jobs.
Dispose = true helps get rid of the rusage info to free DRMAA implementations of heavy memory requirements when it matters, so keeping all the past job_ids for providing precise exit errors runs contrary to the goal of lessening memory requirements in the same routine.
My 2 pfennigs,
Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf
Of
Daniel Templeton Sent: Wednesday, October 19, 2005 11:37 AM To: DRMAA Working Group Subject: [drmaa-wg] Synchronizing Against Waited Jobs
We have found a bug in the SGE DRMAA implementation, (I know! It's shocking!) but Andreas and I can't agree on what the fix should be.
The
issue is that in the current implementation, synchronizing against jobs that did not come from the current session returns DRMAA_ERRNO_SUCCESS. The part about which we disagree is what should happen when synchronizing against jobs that are from the current session, but that have already ended and have already had drmaa_wait() (or drmaa_synchronize() with dispose=true) called against them.
My stance is that one can extrapolate from the drmaa_wait() function that there is no difference between jobs which don't exist (at all or
in
the current session) and jobs whose exit information has been disposed (via drmaa_wait() or drmaa_synchronize()). Therefore, calling drmaa_synchronize() on jobs which have already had drmaa_wait() called against them should return DRMAA_ERRNO_INVALID_JOB.
Andreas holds that it can be inferred from the lack of the above statement in the spec, that drmaa_synchronize() handles such jobs differently from drmaa_wait(). Because drmaa_synchronize() does not need the jobs' exit information to succeed, it should be able to
operate
on jobs whose exit information has already been disposed. Therefore, calling drmaa_synchronize() on jobs which have already had drmaa_wait() called against them should return DRMAA_ERRNO_SUCCESS.
I can agree that Andreas' position makes theoretical sense, but I believe it runs contrary to the stated goal of minimizing the requirements on the implementing DRMS. In order to implement a drmaa_synchronize() that can distinguish between job's that have been disposed and jobs that never existed, the DRMAA implementation must
keep
a list of the ids of every job that has ever been submitted in the current session, and with every drmaa_synchronize() call, the list must be searched to validate the synchronize id list. And for what? DRMAA_JOB_IDS_ALL covers every case I can think of where the behavior Andreas described would be useful. To me, it sounds like a lot of extra work for the DRMAA implementation with no tangible benefit.
On what Andreas and I can agree is that if we decide he is right, we will close the bug as "won't fix" because the fix will be worse than
the
bug. In any case, we should probably have a tracker item to make the final decision explicit in the spec.
What say you, oh, wise ones?
Daniel
-- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "So let the sunshine in. Face it with a grin. * * Smilers never lose, and frowners never win." * * -Let the Sunshine In, Pebbles Flintstone * ***************************************************
So, the only person who hasn't weighed in is Roger. Care to offer an opinion? Daniel Peter Troeger wrote On 10/21/05 10:21,:
I support the argumentation of Hrabri. DRMAA introduced "dispose=true" in the interface, so resource consumption seems to be an issue. If a job was subject to drmaa_wait(), and the data was disposed, nothing should be left in memory about this job. IMHO the job becomes completely unknown to the library after this point.
BTW, this holds also for the current Condor DRMAA implementation. It is also reasoned by the behavior of the underlying Condor system. If a job was finished, only the log files can tell you what happened. The Condor DRMAA library uses such a log file for each job, and if you execute drmaa_wait(dispose=true), the log file and in-memory structures for the job are removed. Calling drmaa_synchronize() after this results in DRMAA_ERRNO_INVALID_JOB.
Things might be clearer if we would have an explicit drmaa_dispose_job() function.
Regards, Peter.
Rajic, Hrabri schrieb:
My wig is in dry cleaning. Nevertheless, here is my short take on this.
If an implementation has handy job_id's it could conveniently make good determination which jobs are invalid (do not exist) and throw DRMAA_ERRNO_INVALID_JOB. IMHO, it is not a big deal if the routine gives imprecise diagnostics if it is forced to do memory garbage collection earlier. Quality of implementation term comes to mind, but that quality could come at the expense of being memory hog that in turn could lead to paging - quite dubious. See, the implementations might differently handle jobs that did not come
from the current session, so we could not be precise here either.
The important thing for the user is to synchronize i.e. block program
from continuing if there are running remote jobs.
Dispose = true helps get rid of the rusage info to free DRMAA implementations of heavy memory requirements when it matters, so keeping all the past job_ids for providing precise exit errors runs contrary to the goal of lessening memory requirements in the same routine.
My 2 pfennigs,
Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf
Of
Daniel Templeton Sent: Wednesday, October 19, 2005 11:37 AM To: DRMAA Working Group Subject: [drmaa-wg] Synchronizing Against Waited Jobs
We have found a bug in the SGE DRMAA implementation, (I know! It's shocking!) but Andreas and I can't agree on what the fix should be.
The
issue is that in the current implementation, synchronizing against jobs that did not come from the current session returns DRMAA_ERRNO_SUCCESS. The part about which we disagree is what should happen when synchronizing against jobs that are from the current session, but that have already ended and have already had drmaa_wait() (or drmaa_synchronize() with dispose=true) called against them.
My stance is that one can extrapolate from the drmaa_wait() function that there is no difference between jobs which don't exist (at all or
in
the current session) and jobs whose exit information has been disposed (via drmaa_wait() or drmaa_synchronize()). Therefore, calling drmaa_synchronize() on jobs which have already had drmaa_wait() called against them should return DRMAA_ERRNO_INVALID_JOB.
Andreas holds that it can be inferred from the lack of the above statement in the spec, that drmaa_synchronize() handles such jobs differently from drmaa_wait(). Because drmaa_synchronize() does not need the jobs' exit information to succeed, it should be able to
operate
on jobs whose exit information has already been disposed. Therefore, calling drmaa_synchronize() on jobs which have already had drmaa_wait() called against them should return DRMAA_ERRNO_SUCCESS.
I can agree that Andreas' position makes theoretical sense, but I believe it runs contrary to the stated goal of minimizing the requirements on the implementing DRMS. In order to implement a drmaa_synchronize() that can distinguish between job's that have been disposed and jobs that never existed, the DRMAA implementation must
keep
a list of the ids of every job that has ever been submitted in the current session, and with every drmaa_synchronize() call, the list must be searched to validate the synchronize id list. And for what? DRMAA_JOB_IDS_ALL covers every case I can think of where the behavior Andreas described would be useful. To me, it sounds like a lot of extra work for the DRMAA implementation with no tangible benefit.
On what Andreas and I can agree is that if we decide he is right, we will close the bug as "won't fix" because the fix will be worse than
the
bug. In any case, we should probably have a tracker item to make the final decision explicit in the spec.
What say you, oh, wise ones?
Daniel
-- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "So let the sunshine in. Face it with a grin. * * Smilers never lose, and frowners never win." * * -Let the Sunshine In, Pebbles Flintstone * ***************************************************
-- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "So let the sunshine in. Face it with a grin. * * Smilers never lose, and frowners never win." * * -Let the Sunshine In, Pebbles Flintstone * ***************************************************
My opinion ... Because a DRMAA implementation is not required to retain information about jobs which have been reaped, drmaa_synchronize should not be required to distinguish between non-existant and reaped jobs. A drmaa_synchronize implementation should return DRMAA_ERRNO_INVALID_JOB if a provided jobID is unrecognized. If a drmaa_synchronize implementation successfully validates a jobID for a reaped job, it may return DRMAA_ERRNO_SUCCESS. -Roger In a previous e-mail, Daniel Templeton wrote:
So, the only person who hasn't weighed in is Roger. Care to offer an opinion?
Daniel
Peter Troeger wrote On 10/21/05 10:21,:
I support the argumentation of Hrabri. DRMAA introduced "dispose=true" in the interface, so resource consumption seems to be an issue. If a job was subject to drmaa_wait(), and the data was disposed, nothing should be left in memory about this job. IMHO the job becomes completely unknown to the library after this point.
BTW, this holds also for the current Condor DRMAA implementation. It is also reasoned by the behavior of the underlying Condor system. If a job was finished, only the log files can tell you what happened. The Condor DRMAA library uses such a log file for each job, and if you execute drmaa_wait(dispose=true), the log file and in-memory structures for the job are removed. Calling drmaa_synchronize() after this results in DRMAA_ERRNO_INVALID_JOB.
Things might be clearer if we would have an explicit drmaa_dispose_job() function.
Regards, Peter.
Rajic, Hrabri schrieb:
My wig is in dry cleaning. Nevertheless, here is my short take on this.
If an implementation has handy job_id's it could conveniently make good determination which jobs are invalid (do not exist) and throw DRMAA_ERRNO_INVALID_JOB. IMHO, it is not a big deal if the routine gives imprecise diagnostics if it is forced to do memory garbage collection earlier. Quality of implementation term comes to mind, but that quality could come at the expense of being memory hog that in turn could lead to paging - quite dubious. See, the implementations might differently handle jobs that did not come
from the current session, so we could not be precise here either.
The important thing for the user is to synchronize i.e. block program
from continuing if there are running remote jobs.
Dispose = true helps get rid of the rusage info to free DRMAA implementations of heavy memory requirements when it matters, so keeping all the past job_ids for providing precise exit errors runs contrary to the goal of lessening memory requirements in the same routine.
My 2 pfennigs,
Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf
Of
Daniel Templeton Sent: Wednesday, October 19, 2005 11:37 AM To: DRMAA Working Group Subject: [drmaa-wg] Synchronizing Against Waited Jobs
We have found a bug in the SGE DRMAA implementation, (I know! It's shocking!) but Andreas and I can't agree on what the fix should be.
The
issue is that in the current implementation, synchronizing against jobs that did not come from the current session returns DRMAA_ERRNO_SUCCESS. The part about which we disagree is what should happen when synchronizing against jobs that are from the current session, but that have already ended and have already had drmaa_wait() (or drmaa_synchronize() with dispose=true) called against them.
My stance is that one can extrapolate from the drmaa_wait() function that there is no difference between jobs which don't exist (at all or
in
the current session) and jobs whose exit information has been disposed (via drmaa_wait() or drmaa_synchronize()). Therefore, calling drmaa_synchronize() on jobs which have already had drmaa_wait() called against them should return DRMAA_ERRNO_INVALID_JOB.
Andreas holds that it can be inferred from the lack of the above statement in the spec, that drmaa_synchronize() handles such jobs differently from drmaa_wait(). Because drmaa_synchronize() does not need the jobs' exit information to succeed, it should be able to
operate
on jobs whose exit information has already been disposed. Therefore, calling drmaa_synchronize() on jobs which have already had drmaa_wait() called against them should return DRMAA_ERRNO_SUCCESS.
I can agree that Andreas' position makes theoretical sense, but I believe it runs contrary to the stated goal of minimizing the requirements on the implementing DRMS. In order to implement a drmaa_synchronize() that can distinguish between job's that have been disposed and jobs that never existed, the DRMAA implementation must
keep
a list of the ids of every job that has ever been submitted in the current session, and with every drmaa_synchronize() call, the list must be searched to validate the synchronize id list. And for what? DRMAA_JOB_IDS_ALL covers every case I can think of where the behavior Andreas described would be useful. To me, it sounds like a lot of extra work for the DRMAA implementation with no tangible benefit.
On what Andreas and I can agree is that if we decide he is right, we will close the bug as "won't fix" because the fix will be worse than
the
bug. In any case, we should probably have a tracker item to make the final decision explicit in the spec.
What say you, oh, wise ones?
Daniel
-- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "So let the sunshine in. Face it with a grin. * * Smilers never lose, and frowners never win." * * -Let the Sunshine In, Pebbles Flintstone * ***************************************************
-- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "So let the sunshine in. Face it with a grin. * * Smilers never lose, and frowners never win." * * -Let the Sunshine In, Pebbles Flintstone * ***************************************************
That sounds like a majority to me. I will submit a tracker using Roger's lovely summary of the correct behavior. Daniel Roger Brobst wrote On 10/24/05 17:15,:
My opinion ...
Because a DRMAA implementation is not required to retain information about jobs which have been reaped, drmaa_synchronize should not be required to distinguish between non-existant and reaped jobs.
A drmaa_synchronize implementation should return DRMAA_ERRNO_INVALID_JOB if a provided jobID is unrecognized.
If a drmaa_synchronize implementation successfully validates a jobID for a reaped job, it may return DRMAA_ERRNO_SUCCESS.
-Roger
In a previous e-mail, Daniel Templeton wrote:
So, the only person who hasn't weighed in is Roger. Care to offer an opinion?
Daniel
Peter Troeger wrote On 10/21/05 10:21,:
I support the argumentation of Hrabri. DRMAA introduced "dispose=true" in the interface, so resource consumption seems to be an issue. If a job was subject to drmaa_wait(), and the data was disposed, nothing should be left in memory about this job. IMHO the job becomes completely unknown to the library after this point.
BTW, this holds also for the current Condor DRMAA implementation. It is also reasoned by the behavior of the underlying Condor system. If a job was finished, only the log files can tell you what happened. The Condor DRMAA library uses such a log file for each job, and if you execute drmaa_wait(dispose=true), the log file and in-memory structures for the job are removed. Calling drmaa_synchronize() after this results in DRMAA_ERRNO_INVALID_JOB.
Things might be clearer if we would have an explicit drmaa_dispose_job() function.
Regards, Peter.
Rajic, Hrabri schrieb:
My wig is in dry cleaning. Nevertheless, here is my short take on this.
If an implementation has handy job_id's it could conveniently make good determination which jobs are invalid (do not exist) and throw DRMAA_ERRNO_INVALID_JOB. IMHO, it is not a big deal if the routine gives imprecise diagnostics if it is forced to do memory garbage collection earlier. Quality of implementation term comes to mind, but that quality could come at the expense of being memory hog that in turn could lead to paging - quite dubious. See, the implementations might differently handle jobs that did not come
from the current session, so we could not be precise here either.
The important thing for the user is to synchronize i.e. block program
from continuing if there are running remote jobs.
Dispose = true helps get rid of the rusage info to free DRMAA implementations of heavy memory requirements when it matters, so keeping all the past job_ids for providing precise exit errors runs contrary to the goal of lessening memory requirements in the same routine.
My 2 pfennigs,
Hrabri
-----Original Message----- From: owner-drmaa-wg@ggf.org [mailto:owner-drmaa-wg@ggf.org] On Behalf
Of
Daniel Templeton Sent: Wednesday, October 19, 2005 11:37 AM To: DRMAA Working Group Subject: [drmaa-wg] Synchronizing Against Waited Jobs
We have found a bug in the SGE DRMAA implementation, (I know! It's shocking!) but Andreas and I can't agree on what the fix should be.
The
issue is that in the current implementation, synchronizing against jobs that did not come from the current session returns DRMAA_ERRNO_SUCCESS. The part about which we disagree is what should happen when synchronizing against jobs that are from the current session, but that have already ended and have already had drmaa_wait() (or drmaa_synchronize() with dispose=true) called against them.
My stance is that one can extrapolate from the drmaa_wait() function that there is no difference between jobs which don't exist (at all or
in
the current session) and jobs whose exit information has been disposed (via drmaa_wait() or drmaa_synchronize()). Therefore, calling drmaa_synchronize() on jobs which have already had drmaa_wait() called against them should return DRMAA_ERRNO_INVALID_JOB.
Andreas holds that it can be inferred from the lack of the above statement in the spec, that drmaa_synchronize() handles such jobs differently from drmaa_wait(). Because drmaa_synchronize() does not need the jobs' exit information to succeed, it should be able to
operate
on jobs whose exit information has already been disposed. Therefore, calling drmaa_synchronize() on jobs which have already had drmaa_wait() called against them should return DRMAA_ERRNO_SUCCESS.
I can agree that Andreas' position makes theoretical sense, but I believe it runs contrary to the stated goal of minimizing the requirements on the implementing DRMS. In order to implement a drmaa_synchronize() that can distinguish between job's that have been disposed and jobs that never existed, the DRMAA implementation must
keep
a list of the ids of every job that has ever been submitted in the current session, and with every drmaa_synchronize() call, the list must be searched to validate the synchronize id list. And for what? DRMAA_JOB_IDS_ALL covers every case I can think of where the behavior Andreas described would be useful. To me, it sounds like a lot of extra work for the DRMAA implementation with no tangible benefit.
On what Andreas and I can agree is that if we decide he is right, we will close the bug as "won't fix" because the fix will be worse than
the
bug. In any case, we should probably have a tracker item to make the final decision explicit in the spec.
What say you, oh, wise ones?
Daniel
-- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "So let the sunshine in. Face it with a grin. * * Smilers never lose, and frowners never win." * * -Let the Sunshine In, Pebbles Flintstone * ***************************************************
-- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "So let the sunshine in. Face it with a grin. * * Smilers never lose, and frowners never win." * * -Let the Sunshine In, Pebbles Flintstone * ***************************************************
-- *************************************************** * Daniel Templeton ERGB01 x60220 * * Staff Engineer, Sun N1 Grid Engine * *************************************************** * "So let the sunshine in. Face it with a grin. * * Smilers never lose, and frowners never win." * * -Let the Sunshine In, Pebbles Flintstone * ***************************************************
participants (4)
-
Daniel Templeton -
Peter Troeger -
Rajic, Hrabri -
Roger Brobst