Hi! DRMAA test suite (from Condor-ext project) tests for this: - drmaa_control() with no jobs in session but with SESSION_ALL argument succeds, - drmaa_wait()/drmaa_synchroznie() with no jobs in session but with SESSION_ANY/SESSION_ALL argument return INVALID_JOB error, As I remember, there is nothing about it in the specification, and I see little discrepancy - SGE's implementation fails in the first test. -- Piotr Domagalski
Piotr, The first one is addessed by tracker 1683: https://forge.gridforum.org/sf/go/artf2796?nav=1 The second is addressed by tracker 1400: https://forge.gridforum.org/sf/go/artf2798?nav=1 Peter, Looking through the IDL spec, it says that drmaa_wait(ANY) will only work on jobs submitted up to the time of the drmaa_wait() call. I don't like that. For drmaa_synchronize(ALL), it makes sense, because otherwise the call would block indefinitely in an active system. With drmaa_wait(), however, that change prevents a very useful use case. Say I want to write a thread that waits for jobs to end and places their finish information in a data structure for other threads to read. With that caveat applied, if I submit one very long-running job before drmaa_wait() gets called, the hundreds of really short jobs that I submit after the drmaa_wait() call have to wait for the long-running job to end so that the next call to drmaa_wait() can see them. That's bad, and I don't see where it makes anything better. What problem does limiting drmaa_wait() to previously submitted jobs solve? Hrabri, The 0.35 IDL spec is *still* mislinked from drmaa.org. The right link is: https://forge.gridforum.org/sf/docman/do/downloadDocument/projects.drmaa-wg/... Daniel Piotr Domagalski wrote:
Hi!
DRMAA test suite (from Condor-ext project) tests for this:
- drmaa_control() with no jobs in session but with SESSION_ALL argument succeds, - drmaa_wait()/drmaa_synchroznie() with no jobs in session but with SESSION_ANY/SESSION_ALL argument return INVALID_JOB error,
As I remember, there is nothing about it in the specification, and I see little discrepancy - SGE's implementation fails in the first test.
Hrabri,
The 0.35 IDL spec is *still* mislinked from drmaa.org. The right link is:
https://forge.gridforum.org/sf/docman/do/downloadDocument/projects.drmaa-wg/...
As everybody agreed to have the (updated) main page as wiki page, I prepared a first version: http://www.drmaa.org/wiki/index.php/home I cleaned up the content a little bit, and updated all relevant links. Please make your changes (it's a Wiki !), and inform the list when you are done. When everybody is happy, we could switch to the new start page. Regards, Peter.
Just a reminder: I created a Wiki page which could act as new drmaa.org startup page (see below). I need your comments / additions. Thanks, Peter. Anfang der weitergeleiteten E-Mail:
Von: Peter Troeger <peter.troeger@hpi.uni-potsdam.de> Datum: 23. Juni 2006 11:23:44 MESZ An: Daniel Templeton <Dan.Templeton@Sun.COM> Kopie: drmaa-wg@gridforum.org Betreff: Re: [drmaa-wg] DRMAA test suite
Hrabri,
The 0.35 IDL spec is *still* mislinked from drmaa.org. The right link is:
https://forge.gridforum.org/sf/docman/do/downloadDocument/ projects.drmaa-wg/docman.root.ggf_14/doc5555/2
As everybody agreed to have the (updated) main page as wiki page, I prepared a first version:
http://www.drmaa.org/wiki/index.php/home
I cleaned up the content a little bit, and updated all relevant links. Please make your changes (it's a Wiki !), and inform the list when you are done. When everybody is happy, we could switch to the new start page.
Regards, Peter.
Looking through the IDL spec, it says that drmaa_wait(ANY) will only work on jobs submitted up to the time of the drmaa_wait() call. I don't like that. For drmaa_synchronize(ALL), it makes sense, because otherwise the call would block indefinitely in an active system. With drmaa_wait(), however, that change prevents a very useful use case. Say I want to write a thread that waits for jobs to end and places their finish information in a data structure for other threads to read. With that caveat applied, if I submit one very long-running job before drmaa_wait() gets called, the hundreds of really short jobs that I submit after the drmaa_wait() call have to wait for the long-running job to end so that the next call to drmaa_wait() can see them. That's bad, and I don't see where it makes anything better. What problem does limiting drmaa_wait() to previously submitted jobs solve?
We had so much discussion around the drmaa_wait semantics, I am not sure what the exact reason was. For me, it seems like the same argumentation as with drmaa_synchronize. The drmaa_wait() call relies on some current state of all the jobs in the session. I know that I submitted 3 jobs so far, and now I want to wait for all of them. If we allow other threads to extend the session while drmaa_wait() is running, you need to clarify the point of synchronization within the running drmaa_wait() call. It's harder to implement. In your particular example, my expectation would be that the second thread also calls drmaa_wait() in parallel. In this case, our modified text from the latest DRMAA doc can be applied: -- snip In a multithreaded environment, only the active thread gets the status of the finished or failed job in that case, while the rest of the threads continue waiting. If there are no more running or completed jobs the routine SHOULD return DRMAA_ERRNO_INVALID_JOB error. -- snip We can summarize that drmaa_wait(SESSION_ANY) is always a bad idea when multiple threads submitting jobs. In order to get a consistent picture, it seems to be appropriate to define the function call as "synchronization point", where the session state "at this time" acts as input to the method. Peter.
Peter, I know we discussed this at length already, but I remember the discussion being about synchronize(). Sorry to beat dead horses, but this one seems like a real handicap to me. Basically, you're saying that wait(), a very important element of the API, isn't useful in an MT environment. I've never seen a DRMAA app, other than the most trivial example code, that doesn't call wait(). The logical conclusion, then, is that DRMAA isn't useful in an MT environment. I think you misunderstood my example. Say I want to build an event mechanism around jobs finishing. The most sensible way to do that is to dedicate a thread to doing wait() calls. That thread would then put the JobInfo objects in some common data cache and send out an event notifying the other threads that a new JobInfo object is available. (Or it could send the job exit info in the event; same difference.) The proposed semantics for wait() effectively prevents this use case. I would basically have to have my wait thread set a short time limit on the wait() call so that it can keep the wait() context current, effectively turning it into polling instead of a blocking procedure. As for being more difficult to implement, I completely disagree. In the SGE implementation, wait(ANY) just waits for a job finish event from the qmaster. To implement the proposed semantics, I'd have to take a snapshot of the current job list and compare incoming events to that list. For synchronize(), keeping the call-time context makes perfect sense. I can see where one could strictly argue that if it's good for synchronize(), it's good for wait(), but we need to be a little pragmatic. Again, I ask the question, "what problem does it solve?" The behavior of the proposed semantics can be duplicated in several ways using the current semantics, but I don't see how to reasonably get the current behavior from the proposed semantics. Also, keep in mind that you're changing the behavior of a core routine in a way that affects some basic use cases. Not good. The up-side is that we're talking about the IDL spec, and not the DRMAA 1.0 spec, so we have room to make changes still. And, in case it wasn't clear from my tirade, the SGE DRMAA implementations do not limit the context of wait(), so the proposed semantics will be a change (for the worse) for SGE users. Daniel Peter Troeger wrote:
Looking through the IDL spec, it says that drmaa_wait(ANY) will only work on jobs submitted up to the time of the drmaa_wait() call. I don't like that. For drmaa_synchronize(ALL), it makes sense, because otherwise the call would block indefinitely in an active system. With drmaa_wait(), however, that change prevents a very useful use case. Say I want to write a thread that waits for jobs to end and places their finish information in a data structure for other threads to read. With that caveat applied, if I submit one very long-running job before drmaa_wait() gets called, the hundreds of really short jobs that I submit after the drmaa_wait() call have to wait for the long-running job to end so that the next call to drmaa_wait() can see them. That's bad, and I don't see where it makes anything better. What problem does limiting drmaa_wait() to previously submitted jobs solve?
We had so much discussion around the drmaa_wait semantics, I am not sure what the exact reason was. For me, it seems like the same argumentation as with drmaa_synchronize. The drmaa_wait() call relies on some current state of all the jobs in the session. I know that I submitted 3 jobs so far, and now I want to wait for all of them. If we allow other threads to extend the session while drmaa_wait() is running, you need to clarify the point of synchronization within the running drmaa_wait() call. It's harder to implement.
In your particular example, my expectation would be that the second thread also calls drmaa_wait() in parallel. In this case, our modified text from the latest DRMAA doc can be applied:
-- snip
In a multithreaded environment, only the active thread gets the status of the finished or failed job in that case, while the rest of the threads continue waiting. If there are no more running or completed jobs the routine SHOULD return DRMAA_ERRNO_INVALID_JOB error.
-- snip
We can summarize that drmaa_wait(SESSION_ANY) is always a bad idea when multiple threads submitting jobs. In order to get a consistent picture, it seems to be appropriate to define the function call as "synchronization point", where the session state "at this time" acts as input to the method.
Peter.
I know we discussed this at length already, but I remember the discussion being about synchronize(). Sorry to beat dead horses, but this one seems like a real handicap to me. Basically, you're saying that wait(), a very important element of the API, isn't useful in an MT environment. I've never seen a DRMAA app, other than the most trivial example code, that doesn't call wait(). The logical conclusion, then, is that DRMAA isn't useful in an MT environment.
Full stop. Reading my own text again, I realize that I mixed up ANY and ALL semantics from wait() and synchronize(). We talk about wait (ANY), which returns one result, and synchronize(ALL) which depends on all jobs in the session. My argumentation was that the session can change during the wait/sync operation - which is bad for the sync (ALL) operation, but not a huge problem for the wait(ANY) case. Sorry for wasting your time ...
For synchronize(), keeping the call-time context makes perfect sense. I can see where one could strictly argue that if it's good for synchronize(), it's good for wait(), but we need to be a little pragmatic. Again, I ask the question, "what problem
What I really wanted to ensure is that the operation with the SESSION_ALL argument has call-time context. I understand that you completely agree to this thing, so we are done.
The up-side is that we're talking about the IDL spec, and not the DRMAA 1.0 spec, so we have room to make changes still. And, in case it wasn't clear from my tirade, the SGE DRMAA implementations do not limit the context of wait(), so the proposed semantics will be a change (for the worse) for SGE users.
I fear that the misleading text in the IDL spec occurred from the same mixup. I need vacation ;-) Sorry, Peter. P.S.: Who will hold the pencil for the upcoming work for the IDL spec ? Currently, the latest version is on my hard-disk.
Daniel
Peter Troeger wrote:
Looking through the IDL spec, it says that drmaa_wait(ANY) will only work on jobs submitted up to the time of the drmaa_wait() call. I don't like that. For drmaa_synchronize(ALL), it makes sense, because otherwise the call would block indefinitely in an active system. With drmaa_wait(), however, that change prevents a very useful use case. Say I want to write a thread that waits for jobs to end and places their finish information in a data structure for other threads to read. With that caveat applied, if I submit one very long-running job before drmaa_wait() gets called, the hundreds of really short jobs that I submit after the drmaa_wait() call have to wait for the long- running job to end so that the next call to drmaa_wait() can see them. That's bad, and I don't see where it makes anything better. What problem does limiting drmaa_wait() to previously submitted jobs solve?
We had so much discussion around the drmaa_wait semantics, I am not sure what the exact reason was. For me, it seems like the same argumentation as with drmaa_synchronize. The drmaa_wait() call relies on some current state of all the jobs in the session. I know that I submitted 3 jobs so far, and now I want to wait for all of them. If we allow other threads to extend the session while drmaa_wait() is running, you need to clarify the point of synchronization within the running drmaa_wait() call. It's harder to implement.
In your particular example, my expectation would be that the second thread also calls drmaa_wait() in parallel. In this case, our modified text from the latest DRMAA doc can be applied:
-- snip
In a multithreaded environment, only the active thread gets the status of the finished or failed job in that case, while the rest of the threads continue waiting. If there are no more running or completed jobs the routine SHOULD return DRMAA_ERRNO_INVALID_JOB error.
-- snip
We can summarize that drmaa_wait(SESSION_ANY) is always a bad idea when multiple threads submitting jobs. In order to get a consistent picture, it seems to be appropriate to define the function call as "synchronization point", where the session state "at this time" acts as input to the method.
Peter.
participants (4)
-
Andreas.Haas@Sun.COM -
Daniel Templeton -
Peter Troeger -
Piotr Domagalski