Dear all, after a very productive face-to-face meeting in Potsdam, we ended up with the new draft 6 of the DRMAAv2 spec. Please find attached the document. I would like to thank Mariusz, Daniel G. and Andre Merczy for investing their time and effort. The good news is that we were able to clarify all pending functional issues. We are now in a sanity check phase, were the text itself gets some proof-reading to find inconsistencies. Since at least three group members are now into reading and editing, I will drop the call for this week. If no serious (I mean *really* serious) things are found, we will wrap-up in a couple of days and perform the official "last call" for comments on the list. Beside that, we started some initial debate on the C binding. Please understand that this discussion will go public only after the IDL spec was submitted, in order to avoid redundant efforts. Best regards, Peter.
Hi, 2011/6/21 Peter Tröger <peter@troeger.eu>:
Dear all,
after a very productive face-to-face meeting in Potsdam, we ended up with the new draft 6 of the DRMAAv2 spec. Please find attached the document. I would like to thank Mariusz, Daniel G. and Andre Merczy for investing their time and effort.
The good news is that we were able to clarify all pending functional issues. We are now in a sanity check phase, were the text itself gets some proof-reading to find inconsistencies.
Since at least three group members are now into reading and editing, I will drop the call for this week. If no serious (I mean *really* serious) things are found, we will wrap-up in a couple of days and perform the official "last call" for comments on the list.
Beside that, we started some initial debate on the C binding. Please understand that this discussion will go public only after the IDL spec was submitted, in order to avoid redundant efforts.
Best regards, Peter.
-- drmaa-wg mailing list drmaa-wg@ogf.org http://www.ogf.org/mailman/listinfo/drmaa-wg
Result of my proof-reading (most of them fortunately minor ;-) line 19: "The scope is limited to job submission, job control, and retrival.." -> "The scope is limited to job submission, job control, reservation management and retrival..." line 100: act as execution -> act as an execution line 206: mention JobTemplate line 255: if possible i would add the following statement, it does not change nothing but brings reader attention to important concept of the DRM systems. "It is worth to mention that the WALLCLOCK_TIME in most of the DRM systems is not only a resource limit but also a key job attribute taken into account in the scheduling process" line 291: line falling behind the margin line 364: missing "\" line 382: Maybe we should clarify that te value should be eventually normalized: e.g.: "The load value MUST be always within the <0;1> range (inclusive). The value 0 should indicate that machine is idling, while the 1 that all computing units are used" line 481: as the JobSubState is an opaque object then passing "sub-state is not suported by the impl.." may simply lead to SEG FAULT ;-) so filtering using sub-state should be permitted if one known which implementation is used. line 513: "The accumulated CPU time" -> "The accumulated, over all job's processes, CPU time" (just a proposition) line 686: expressed by the expressed by -> expressed by line 762: "being allowed on one machine" -> "being allowed to run" (@see maxSlots) line 863: a Uns.. -> an Unsup... line 878: missing space after "support" line 890: missing space after "reservation." line 895: missing space after "machines." line 1069: Should we state that is enough that session names must be unique for tuple (DRMS,user) line 1097: Should we explicitly mention when one can call the destroySession ? If yes i would propose "only for not opened session". line 1183: sessionName can be also generated by the implementation... line 1374: what about job objects returned in the monitoring session? which session should be referred then? line 1384: maybe we should warn here that this operation might not be atomic. footnote 39: "start and time" -> "start and end time" line 1837: poznan -> poznan.pl Cheers, -- Mariusz
Hi Mariusz, some comments inlined :-) Cheers, Andre. 2011/6/23 Mariusz Mamoński <mamonski@man.poznan.pl>:
"The load value MUST be always within the <0;1> range (inclusive). The value 0 should indicate that machine is idling, while the 1 that all computing units are used"
Sounds sensible to me, although I have often seen load values >1, mostly indicating that a machine is overloaded. You may want to change the MUST into a SHOULD thus?
line 1069: Should we state that is enough that session names must be unique for tuple (DRMS,user)
line 1097: Should we explicitly mention when one can call the destroySession ? If yes i would propose "only for not opened session".
These two items together imply that it is an error if I open a session in one application instance, and destroy it in another instance which runs at the same time. Which instance will show the error? Both? How is synchronization done? The fundamental problem seems to be that the spec introduces stateful sessions which do not (necessarily) have any state management in the backend. If you library itself is maintaining the state, you will introduce race conditions. Cheers, Andre. -- Nothing is ever easy...
2011/6/23 Andre Merzky <andre@merzky.net>:
Hi Mariusz,
some comments inlined :-)
Cheers, Andre.
2011/6/23 Mariusz Mamoński <mamonski@man.poznan.pl>:
"The load value MUST be always within the <0;1> range (inclusive). The value 0 should indicate that machine is idling, while the 1 that all computing units are used"
Sounds sensible to me, although I have often seen load values >1, mostly indicating that a machine is overloaded. You may want to change the MUST into a SHOULD thus?
i basically wanted to avoid situation that this value is "number of core specific" ;-)
line 1069: Should we state that is enough that session names must be unique for tuple (DRMS,user)
line 1097: Should we explicitly mention when one can call the destroySession ? If yes i would propose "only for not opened session".
These two items together imply that it is an error if I open a session in one application instance, and destroy it in another instance which runs at the same time. Which instance will show the error? Both? How is synchronization done?
I think opening the same session **concurrently** in two application falls into "invalid usage".
The fundamental problem seems to be that the spec introduces stateful sessions which do not (necessarily) have any state management in the backend. If you library itself is maintaining the state, you will introduce race conditions.
Cheers, Andre.
-- Nothing is ever easy...
-- Mariusz
2011/6/23 Mariusz Mamoński <mamonski@man.poznan.pl>:
2011/6/23 Andre Merzky <andre@merzky.net>:
Hi Mariusz,
line 1069: Should we state that is enough that session names must be unique for tuple (DRMS,user)
line 1097: Should we explicitly mention when one can call the destroySession ? If yes i would propose "only for not opened session".
These two items together imply that it is an error if I open a session in one application instance, and destroy it in another instance which runs at the same time. Which instance will show the error? Both? How is synchronization done?
I think opening the same session **concurrently** in two application falls into "invalid usage".
Then that needs to be documented in the spec. FWIW, this will be very hard on the end user. For example, tool developers which build tools upon DRMAA have no control over how the tools are used, and how instances are synchronized. This will be particularly difficult as sessions are supposed to be persistent, and thus are *supposed* to be used (i.e. opened) in different application instances. I don't see a better solution - just saying. I guess at the end this will only really work if the DRM system can support the session state's persistence... Cheers, Andre.
The fundamental problem seems to be that the spec introduces stateful sessions which do not (necessarily) have any state management in the backend. If you library itself is maintaining the state, you will introduce race conditions.
Cheers, Andre.
-- Nothing is ever easy...
-- Mariusz
-- Nothing is ever easy...
2011/6/24 Andre Merzky <andre@merzky.net>:
2011/6/23 Mariusz Mamoński <mamonski@man.poznan.pl>:
2011/6/23 Andre Merzky <andre@merzky.net>:
Hi Mariusz,
line 1069: Should we state that is enough that session names must be unique for tuple (DRMS,user)
line 1097: Should we explicitly mention when one can call the destroySession ? If yes i would propose "only for not opened session".
These two items together imply that it is an error if I open a session in one application instance, and destroy it in another instance which runs at the same time. Which instance will show the error? Both? How is synchronization done?
I think opening the same session **concurrently** in two application falls into "invalid usage".
Then that needs to be documented in the spec.
FWIW, this will be very hard on the end user. For example, tool developers which build tools upon DRMAA have no control over how the tools are used, and how instances are synchronized. This will be particularly difficult as sessions are supposed to be persistent, and thus are *supposed* to be used (i.e. opened) in different application instances.
this is still possible but sequentially not concurrently and i think it serves most of the use cases. I guess it typically would be the same application but different run. I think one of the idea of introducing the restartable session concept in DRMAA 2.0 was that in DRMAA 1.0 you had to (in theory) keep the application running as long as you had some job in the system.
I don't see a better solution - just saying. I guess at the end this will only really work if the DRM system can support the session state's persistence...
Cheers, Andre.
The fundamental problem seems to be that the spec introduces stateful sessions which do not (necessarily) have any state management in the backend. If you library itself is maintaining the state, you will introduce race conditions.
Cheers, Andre.
-- Nothing is ever easy...
-- Mariusz
-- Nothing is ever easy...
-- Mariusz
Hi again, 2011/6/24 Mariusz Mamoński <mamonski@man.poznan.pl>:
2011/6/24 Andre Merzky <andre@merzky.net>:
FWIW, this will be very hard on the end user. For example, tool developers which build tools upon DRMAA have no control over how the tools are used, and how instances are synchronized. This will be particularly difficult as sessions are supposed to be persistent, and thus are *supposed* to be used (i.e. opened) in different application instances.
this is still possible but sequentially not concurrently and i think it serves most of the use cases. I guess it typically would be the same application but different run. I think one of the idea of introducing the restartable session concept in DRMAA 2.0 was that in DRMAA 1.0 you had to (in theory) keep the application running as long as you had some job in the system.
Yes, I agree that this is the most interesting use case. Best, Andre. -- Nothing is ever easy...
"Load-value normalizing" : Am 23.06.2011 um 23:48 schrieb Andre Merzky:
Hi Mariusz,
some comments inlined :-)
Cheers, Andre.
2011/6/23 Mariusz Mamoński <mamonski@man.poznan.pl>:
"The load value MUST be always within the <0;1> range (inclusive). The value 0 should indicate that machine is idling, while the 1 that all computing units are used"
Sounds sensible to me, although I have often seen load values >1, mostly indicating that a machine is overloaded. You may want to change the MUST into a SHOULD thus?
I disagree! We agreed that the value "is similar to the uptime" command. Load values indeed can be bigger than 1 because they measure the amount of "runnable" processes in average. There is no need to artificially normalize the value somehow because the max. number is unknown. We should take whatever the DRM is reporting us, and this is similar to the uptime command (and by the way also depends on the amount of cores). This is we agreed on. Daniel
2011/6/24 Daniel Gruber <dgruber@univa.com>:
"Load-value normalizing" :
Am 23.06.2011 um 23:48 schrieb Andre Merzky:
Hi Mariusz,
some comments inlined :-)
Cheers, Andre.
2011/6/23 Mariusz Mamoński <mamonski@man.poznan.pl>:
"The load value MUST be always within the <0;1> range (inclusive). The value 0 should indicate that machine is idling, while the 1 that all computing units are used"
Sounds sensible to me, although I have often seen load values >1, mostly indicating that a machine is overloaded. You may want to change the MUST into a SHOULD thus?
I disagree! We agreed that the value "is similar to the uptime" command. Load values indeed can be bigger than 1 because they measure the amount of "runnable" processes in average. There is no need to artificially normalize the value somehow because the max. number is unknown. We should take whatever the DRM is reporting us, and this is similar to the uptime command (and by the way also depends on the amount of cores). This is we agreed on.
ok, you convinced me. Lets leave this as it is.
Daniel
-- Mariusz
participants (4)
-
Andre Merzky -
Daniel Gruber -
Mariusz Mamoński -
Peter Tröger