Additional Input into BES w.r.t ESI document

All, From the OMII-Europe project (which is engaging with GGF in implementing grid standards) here are some comments from one of the teams that will be looking to implement the BES specification. Please respond to the following comments. The attached document has comments on the proposed ESI state model and information/resource model. Steven About the Job Factory Interface: * The list of the defined operations does not cover administrative operations like “rejectJobSubmissions(Policy)” and “allowJobSubmission (Policy)” useful for disabling/enabling new job submissions based on the policy defined by the CE administrator (e.g. if the CE has to be shutdown for maintenance; disable new submissions if the number of active jobs is > 3000; etc). Do you plan to provide it? * The current proposed JobFactory interface allows users to create new jobs AND (optionally) subscribe for notifications. We believe that the job management service should be better decoupled from the notification service, as they provide different functionalities. We suggest that Figure 1 be extended with a new box (“SubscriptionFactory”) which exposes an interface for creating, modifying and removing notification requests. In this way, notification management can be decoupled from job management, allowing a greater degree of flexibility. For example, it would be possible for users to subscribe to notifications after a job has been created (with the current proposed interface, this would not be possible). Moreover, it would allow users to submit to the notification service requests for “cumulative” subscriptions (i.e., in order to receive notifications related to all jobs submitted by the same user, or by members of the same Virtual Organization). * We propose to include a “JobAssess” operation on the JobFactory Interface. This operation provides the user with an estimate of the start running time, e.g. taking into account the current state of the Computing Element, the number of running/queued jobs and other parameters. * Related with the previous issue, it would be interesting to include an additional operation for estimating the Quality of Service (QoS) for the service instance. The exact meaning of QoS in general depends on user requirements (users may assume different weights for different parameters). * This is probably a “cosmetic” adjustment, but we believe that the name of the “Release” operation makes sense only for reactivating jobs from “Held” states; the transition from “Start Pending” and “Staging In” could probably be called something like “Activate”. * The proposed interface does not provide mechanisms for handling capabilities. With this we refer to the possibility for a user to authorize other user(s) to perform certain operations on his/her jobs. For example, a user may want to allow another user to monitor his/her jobs, or to interrupt and abort jobs and so on. Perhaps this functionality is not strictly related to job management, but is rather a security issue (which, according to the draft, is still to be discussed). We may keep it for future discussions. About the staging of files: * While it is clear that users might explicitly “push” files from their storage space to the Grid while a job is in “Start Pending” state, it is unclear how users might explicitly “pull” by hand resulting files from the Grid after a job competed, and before everything gets cleaned up. About the Job Interface: * The description of the states of Fig. 2 should be expanded with more details, including details on state transitions (basically, we suggest to put a complete description of the states in section 3.2.1). * Section 5.1: The meaning of the “Log” property on Table 3 is not clear: what does it mean? * From Table 3 we see a JobState property which represents the current state of the job. We think that it would be useful to provide the user with the history of all job status changes with the associated timestamp. We also support the need for exitCode and failureReason attributes (see Section 8.2, issue 9) to describe the job return code and job failure reason respectively. * It may be useful to provide an additional property (we may call it “CommandList”) representing the list of all commands issued for a given job. About the Application Interface: * The status of this interface is unclear: is this section going to be discussed? Are you going to consider the problem of user interaction with running jobs? -- Moreno Marzolla INFN Sezione di Padova, via Marzolo 8, 35100 PADOVA, Italy EMail: moreno.marzolla@pd.infn.it Phone: +39 049 8277047 WWW : http://www.pd.infn.it/~marzolla Fax : +39 049 8756233

On Thu, 2006-05-11 at 08:51 +0100, Steven Newhouse wrote:
All,
From the OMII-Europe project (which is engaging with GGF in implementing grid standards) here are some comments from one of the teams that will be looking to implement the BES specification.
Please respond to the following comments. The attached document has comments on the proposed ESI state model and information/resource model.
Steven
About the Job Factory Interface: * The list of the defined operations does not cover administrative operations like “rejectJobSubmissions(Policy)” and “allowJobSubmission (Policy)” useful for disabling/enabling new job submissions based on the policy defined by the CE administrator (e.g. if the CE has to be shutdown for maintenance; disable new submissions if the number of active jobs is > 3000; etc). Do you plan to provide it?
That sounds like at very least a separate management interface. I think the job factory interface should focus on very basic job creation semantics. This also gives an implementer the flexibility to determine how they want to regulate job creation. Do they want a public interface or do they want some hidden, non-web service controls.
* The current proposed JobFactory interface allows users to create new jobs AND (optionally) subscribe for notifications. We believe that the job management service should be better decoupled from the notification service, as they provide different functionalities. We suggest that Figure 1 be extended with a new box (“SubscriptionFactory”) which exposes an interface for creating, modifying and removing notification requests. In this way, notification management can be decoupled from job management, allowing a greater degree of flexibility. For example, it would be possible for users to subscribe to notifications after a job has been created (with the current proposed interface, this would not be possible). Moreover, it would allow users to submit to the notification service requests for “cumulative” subscriptions (i.e., in order to receive notifications related to all jobs submitted by the same user, or by members of the same Virtual Organization).
In principal I agree that this would be cleaner, but in practice it's nice to reduce message round trips to speed up job submission. And there's nothing at all preventing someone from subscribing later just because there's this one shortcut that is allowed. In practice you also have to be careful separating the two since a subscription after submission causes a race condition between notifications and when the client is setup to receive them. You end up missing notifications frequently, which is especially bad if the client depends on all the notifications to know what to do. Furthermore, if you don't have the subscribe-on-create feature, you end up having to have a two-phase commit model as well which ends up adding yet another message round trip to the job submission.
* We propose to include a “JobAssess” operation on the JobFactory Interface. This operation provides the user with an estimate of the start running time, e.g. taking into account the current state of the Computing Element, the number of running/queued jobs and other parameters.
This sounds like a reservation or agreement interface that should be separate from the basic factory interface.
* Related with the previous issue, it would be interesting to include an additional operation for estimating the Quality of Service (QoS) for the service instance. The exact meaning of QoS in general depends on user requirements (users may assume different weights for different parameters).
Same answer as above.
* This is probably a “cosmetic” adjustment, but we believe that the name of the “Release” operation makes sense only for reactivating jobs from “Held” states; the transition from “Start Pending” and “Staging In” could probably be called something like “Activate”.
I don't think the ESI document says that the Release operation does this. It's only for releasing holds. Perhaps you are assuming an implied hold during "Start Pending" that isn't actually there? Is there a specific quote from the document that you think says this?
* The proposed interface does not provide mechanisms for handling capabilities. With this we refer to the possibility for a user to authorize other user(s) to perform certain operations on his/her jobs. For example, a user may want to allow another user to monitor his/her jobs, or to interrupt and abort jobs and so on. Perhaps this functionality is not strictly related to job management, but is rather a security issue (which, according to the draft, is still to be discussed). We may keep it for future discussions.
Very interesting point, though I think this could be addressed with JSDL extensions unless you want to be able to adjust permissions after the job has been submitted. In that case I think this is yet another separate interface that we could propose later and implementers could decide for themselves whether to allow this on their specific service.
About the staging of files: * While it is clear that users might explicitly “push” files from their storage space to the Grid while a job is in “Start Pending” state, it is unclear how users might explicitly “pull” by hand resulting files from the Grid after a job competed, and before everything gets cleaned up.
We've run into this issue with WS-GRAM. One idea I proposed elsewhere is to have an extension to JSDL that specifies files you are interested in monitoring, and have RPs in the job interface that list URLs for those files so that you can, say, use GridFTP to pull them down. I'm on the wall whether this is appropriate for the ESI job interface or whether this should be a separate file monitoring interface. If you're wondering why I suggest RPs if you already know the file you want, this is because at least in WS-GRAM we allow for mapping of files to a GridFTP server that may not necessarily have the same file system view. In other words, the URL path part may not agree entirely with the path specified in JSDL.
About the Job Interface: * The description of the states of Fig. 2 should be expanded with more details, including details on state transitions (basically, we suggest to put a complete description of the states in section 3.2.1).
I'll leave this up to Ian to address.
* Section 5.1: The meaning of the “Log” property on Table 3 is not clear: what does it mean?
I asked the same question. I believe it will be cleared up in a later version of the spec.
* From Table 3 we see a JobState property which represents the current state of the job. We think that it would be useful to provide the user with the history of all job status changes with the associated timestamp. We also support the need for exitCode and failureReason attributes (see Section 8.2, issue 9) to describe the job return code and job failure reason respectively.
A state history is an interesting idea. That could clear up some of the issues I raised with not having subscribe-on-create. Also, good point about the exit code and failure reason. I don't know if the authors intended for this to be encompassed in the StateType (JobState RP), but if not I agree that these are definitely needed.
* It may be useful to provide an additional property (we may call it “CommandList”) representing the list of all commands issued for a given job.
What do you mean by "command"? Typically there is only one executable, so if that's what you mean I don't quite follow you. Peter
About the Application Interface: * The status of this interface is unclear: is this section going to be discussed? Are you going to consider the problem of user interaction with running jobs?

Dear Peter, thank you very much for your feedback. Just some small comments on some of your points (we agree with the rest) [...]
* This is probably a “cosmetic” adjustment, but we believe that the name of the “Release” operation makes sense only for reactivating jobs from “Held” states; the transition from “Start Pending” and “Staging In” could probably be called something like “Activate”.
I don't think the ESI document says that the Release operation does this. It's only for releasing holds. Perhaps you are assuming an implied hold during "Start Pending" that isn't actually there? Is there a specific quote from the document that you think says this?
From section 3.2.1: "The job will remain in Pending until it receives a Release event from the Job Release operation, which will move it to Running". (I think that the "Pending" state mentioned here is the state labeled as "Start Pending" on Figure 2). From this we where assuming an implicit hold. If this is not the case, is the transition Pending -> Staging In instantaneous? I think that it would be useful to provide a complete description of states and transitions in section 3.2.1 of the document (that is, expanding Table 1). [...]
About the staging of files: * While it is clear that users might explicitly “push” files from their storage space to the Grid while a job is in “Start Pending” state, it is unclear how users might explicitly “pull” by hand resulting files from the Grid after a job competed, and before everything gets cleaned up.
We've run into this issue with WS-GRAM. One idea I proposed elsewhere is to have an extension to JSDL that specifies files you are interested in monitoring, and have RPs in the job interface that list URLs for those files so that you can, say, use GridFTP to pull them down. I'm on the wall whether this is appropriate for the ESI job interface or whether this should be a separate file monitoring interface. If you're wondering why I suggest RPs if you already know the file you want, this is because at least in WS-GRAM we allow for mapping of files to a GridFTP server that may not necessarily have the same file system view. In other words, the URL path part may not agree entirely with the path specified in JSDL.
What we had in mind was simply to make the transition(s) from states "Executing" -> "Staging Out" -> "Cleaning Up" not instantaneous, so that files do not get immediately purged. Anyway, according to the current JSDL specification, it is possible to specify DataStaging elements with the "DeleteOnTermination" sub-element set to False, meaning that a given file should NOT be deleted after the job terminates. In this case, even if the transitions "Executing" -> "Staging Out" -> "Cleaning Up" are instantaneous, the user might later access files for which the "DeleteOnTermination" flag was set to false. [...]
* It may be useful to provide an additional property (we may call it “CommandList”) representing the list of all commands issued for a given job.
What do you mean by "command"? Typically there is only one executable, so if that's what you mean I don't quite follow you.
Well, we were using the term "command" to indicate the Job Interface Operations described in Section 5.2. Basically, for administrative purposes it could be useful if the sequence of operations performed on each job is logged and could be accessed in some way. But probably this operation is not strictly related to job submission and should be considered in a separate interface. Regards, Moreno. -- Moreno Marzolla INFN Sezione di Padova, via Marzolo 8, 35100 PADOVA, Italy EMail: moreno.marzolla@pd.infn.it Phone: +39 049 8277047 WWW : http://www.pd.infn.it/~marzolla Fax : +39 049 8756233

On May 19, Moreno Marzolla modulated: ...
From section 3.2.1: "The job will remain in Pending until it receives a Release event from the Job Release operation, which will move it to Running". (I think that the "Pending" state mentioned here is the state labeled as "Start Pending" on Figure 2). From this we where assuming an implicit hold. If this is not the case, is the transition Pending -> Staging In instantaneous? I think that it would be useful to provide a complete description of states and transitions in section 3.2.1 of the document (that is, expanding Table 1).
I would hope this is simply a text versioning issue, and that any of the hold states would be explicitly enabled/disabled by job document content so that the transitions can be instant or blocked on client "release" actions.
What we had in mind was simply to make the transition(s) from states "Executing" -> "Staging Out" -> "Cleaning Up" not instantaneous, so that files do not get immediately purged.
Yes, we do the same thing as you are thinking in WS-GRAM through the use of a hold state between the staging out and cleaning up steps. The other issue Peter is highlighting is something we had to address in WS-GRAM: the namespaces of the local job filesystem and the data access interface may not be identical, so some sort of mapping needs to be exported. In our case, standard GridFTP is the data access (monitoring) interface, so we had to expose the mapping for the client to convert a job output file URI into a GridFTP URL. With a web-service based file access interface, this mapping could probably be wrapped up so that the data access remains relative to the job's output file namespace?
Anyway, according to the current JSDL specification, it is possible to specify DataStaging elements with the "DeleteOnTermination" sub-element set to False, meaning that a given file should NOT be deleted after the job terminates. In this case, even if the transitions "Executing" -> "Staging Out" -> "Cleaning Up" are instantaneous, the user might later access files for which the "DeleteOnTermination" flag was set to false.
Personally, I think that is insufficient in JSDL. You want to be able to pause for safe interlock of 3rd-party data monitoring, but still be able to clean up without having either of: -- client obligation to remove files -- unbounded lifetime of temporary job files after job exits There are jobs where cleanup is not desired, but I do not think that should be considered the same as this output monitoring and synchronization... karl -- Karl Czajkowski karlcz@univa.com

* It may be useful to provide an additional property (we may call it “CommandList”) representing the list of all commands issued for a given job.
What do you mean by "command"? Typically there is only one executable, so if that's what you mean I don't quite follow you.
Well, we were using the term "command" to indicate the Job Interface Operations described in Section 5.2. Basically, for administrative purposes it could be useful if the sequence of operations performed on each job is logged and could be accessed in some way. But probably this operation is not strictly related to job submission and should be considered in a separate interface.
Ok. I personally can't come up with a use case, but that doesn't mean one doesn't exist. Care to elaborate why this is useful? Do you not trust that the service will do what you tell it to do? Peter

Peter G Lane wrote:
Well, we were using the term "command" to indicate the Job Interface Operations described in Section 5.2. Basically, for administrative purposes it could be useful if the sequence of operations performed on each job is logged and could be accessed in some way. But probably this operation is not strictly related to job submission and should be considered in a separate interface.
Ok. I personally can't come up with a use case, but that doesn't mean one doesn't exist. Care to elaborate why this is useful? Do you not trust that the service will do what you tell it to do?
Probably it is not a very frequent use case; anyway, if users are allowed to authorize other users to issue commands (i.e., suspend, resume, cancel operations) on their own jobs, then it may be useful to inspect the list of commands which have been issued to a given job, as they may have not been given by the job owner. This feature may be useful for tracking down problems (for debugging purposes). Perhaps, rather than making such "job command list" property mandatory, we may use the "Extensibility" element which is cited in the ESI document on section 5.1? Moreno. -- Moreno Marzolla INFN Sezione di Padova, via Marzolo 8, 35100 PADOVA, Italy EMail: moreno.marzolla@pd.infn.it Phone: +39 049 8277047 WWW : http://www.pd.infn.it/~marzolla Fax : +39 049 8756233

On Wed, 2006-05-24 at 12:21 +0200, Moreno Marzolla wrote:
Peter G Lane wrote:
Well, we were using the term "command" to indicate the Job Interface Operations described in Section 5.2. Basically, for administrative purposes it could be useful if the sequence of operations performed on each job is logged and could be accessed in some way. But probably this operation is not strictly related to job submission and should be considered in a separate interface.
Ok. I personally can't come up with a use case, but that doesn't mean one doesn't exist. Care to elaborate why this is useful? Do you not trust that the service will do what you tell it to do?
Probably it is not a very frequent use case; anyway, if users are allowed to authorize other users to issue commands (i.e., suspend, resume, cancel operations) on their own jobs, then it may be useful to inspect the list of commands which have been issued to a given job, as they may have not been given by the job owner. This feature may be useful for tracking down problems (for debugging purposes).
Perhaps, rather than making such "job command list" property mandatory, we may use the "Extensibility" element which is cited in the ESI document on section 5.1?
Yeah, perhaps that's a better idea. Auditing seems to me a very different problem than the one that BES is trying to solve. GRAM, for example, is implementing auditing that logs straight to a database. I think we'd want the flexibility to do this our own way. Peter
Moreno.
participants (4)
-
Karl Czajkowski
-
Moreno Marzolla
-
Peter G Lane
-
Steven Newhouse