Interesting case Mariusz. It could be a LSF bug or an implementation difficulty (maybe they don't check suspended jobs for limits, because they do not need resources). It would be clearer if you could construct a case where the the job has a runtime of N seconds. After starting it should be suspended immediately then after N seconds it should be unsuspended. Now when the job resumes the question is if it is running another N seconds or will it be deleted immediately. Taking the sleep binary itself could be also problematic since AFAIK it sets a timer and suspends itself. Cheers, Daniel Am 13.04.2011 um 17:59 schrieb Mariusz Mamoński:
2011/4/6 Peter Tröger <peter@troeger.eu>:
Participants: Daniel Gruber, Mariusz Mamonski, Andre Merzcy, Peter Tröger. Organizational aspects: - Oracle bridge is no longer available for us - Skype conference call worked fine, we continue like this - Daniel will check for possibilities with Univa - If US participants are still missing next week, we will move to a more Europe-friendly time slot DRMAAv2 Draft 2: - Decision to remove last sentence in line 101 - Boolean UNSET mapping should also be part of the language binding - Example from Andre: Struct might map to dictionary, which can just leave out keys in case of UNSET - Discussion about throwing out IRIX / TRUE64, not accepted since this enumeration was already heavily discussed - Line 182, add CRAY: rejected, we are not aware of any relevant DRM system available on CRAY; its also not an operating system - Line 198: Question about POWER, turned out that POWER is a subset of the PPC instruction set architecture, so the current solution is fine - Section 4.2: Discussion about adding GPU support - There are no good standards for GPU instruction set architectures, so having abstract GPU type definitions would be hard - Current DRM system support is also mostly based on targeting some Linux host with specialized resource demand formulations - This is solved way better with job categories - Line 246: Comparison of wall clock time definitions in several DRM systems - Weak agreement of defining it as time in RUNNING state plus time in SUSPENDED state (ok for Condor, Grid Engine) - Mariusz still tries to find an example were SUSPENDED state is not included found! ;-) Platform LSF. I did the following experiment:
1. submitted job with WALLCLOCK time limit 1 min:
$bsub -W 00:01 sleep 600 # 10 min sleep Job <114> is submitted to default queue <medium_priority>. ... jobs get killed while reaching the wallclock time ...
$bjobs -l 114 ... Wed Apr 13 14:56:55: Completed <exit>; TERM_RUNLIMIT: job killed after reaching LSF run time limit.
2. submitted job with WALLCLOCK time limit 1 min:
$ date Wed Apr 13 14:32:52 BST 2011 $bsub -W 00:01 sleep 600 Job <113> is submitted to default queue <medium_priority>.
$ bstop 113 Job <113> is being stopped
... after some time...
$ date Wed Apr 13 14:55:16 BST 2011 $ bjobs JOBID USER STAT QUEUE FROM_HOST EXEC_HOST JOB_NAME SUBMIT_TIME 113 mpiuser USUSP medium_pri x7500 ex-9-0 sleep 600 Apr 13 14:33
$ bresume 113 Job <113> is being resumed
jobs finished immediately (sleep counts the time when the process was suspended)
$bjobs -l 113 ... Wed Apr 13 14:33:09: Started on <ex-9-0>, Execution Home </home/mpiuser>, Execu tion CWD </home/mpiuser>; Wed Apr 13 14:55:35: Done successfully. The CPU time used is 0.0 seconds.
as you can see job was in SUSPEND + RUNNING state > 12 min > wallclocktime limit = 1min.
- Final decision next weak,especially if inclusion of SUSPENDED is marked as "MAY" or "MUST" - Line 249, question by Daniel Katz: Yes, this is a standard feature, e.g. for advance reservation support. Add note in the rationale section. - Line 272: Remove first sentence, since this violates the "opaque concept" statement in the next sentence. - Line 277: New proposal by Mariusz - replace "maxWallclockTime" with a generic dictionary for queue attributes - Would allow to report DRM-specific properties of a queue, in the same opaque sense as the queue name - Only helpful for portal case, should not be the base for programmatic decisions - No clear decision, deferred to next week The next conference call with Skype will take place in one week (Apr 13th, 19:00 UTC) Best regards, Peter.
Am 04.04.2011 um 00:28 schrieb Peter Tröger:
Dear all,
the next DRMAA conf call is scheduled for Apr 6th, 19:00 UTC.The phone conference line is sponsored by Oracle:
Phone number (toll-free from US): +001-866-545-5227 Access code: 5988285
The conference bridge MAY no longer work (Dan ?), in this case, we will organize something based on Skype. Preliminary meeting agenda:
1. Meeting secretary for this meeting? 2. Latest updates from the participants 3. Solving the remaining issues in DRMAAv2 Draft 2 (see attachment)
The attachment draft update already incorporates the comments from Andre Merzcy and Daniel S. Katz. Thanks for their input ! Best regards, Peter.
<drmaav2_draft2_annotated.pdf> -- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
-- Mariusz -- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
--------------------------------------------------------------------- Notice from Univa Postmaster: This email message is for the sole use of the intended recipient(s) and may contain confidential and privileged information. Any unauthorized review, use, disclosure or distribution is prohibited. If you are not the intended recipient, please contact the sender by reply email and destroy all copies of the original message. This message has been content scanned by the Univa Mail system. ---------------------------------------------------------------------