Re: [Pgi-wg] OGF PGI - AGU Execution Service Strawman Rendering

4 Dec 2009


      On Mon, 26 Oct 2009, Etienne URBAH wrote:
...
Aleksandr, Balazs, Morris, Luigi, Johannes and all,
Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI :
Aleksandr KONSTANTINOV and myself had a telephone talk on Friday 23 
October at 16h, and we pointed the question 'if we want to tie state 
changes to operation tightly or operation may aggregate multiple state 
changes'.
I am NOT an expert on Web Services.  I can imagine 3 ways to implement 
message transfers (between Job Submitter and Execution Service) 
according to the 'Single Job State Model' :
If Execution Service and Job Submitter both implement notifications
-------------------------------------------------------------------
This asynchronous method is most efficient, but is NOT mandatory.
-  On job submission :
   - The Job Submitter sends a 'CreateActivity' request containing 2 
parameters :
     - The vector of Job descriptions,
     - The URL for notification.
   - The Execution Service immediately sends back a 'CreateActivity' 
response containing Jobids (or error messages).
-  The Job Submitter waits for notifications.
-  Whenever the Job Submitter receives from the Execution Service a 
'Hold' notification (containing for example the location for manual file 
staging) :
   - He/she performs the appropriate work (for example manual file 
staging),
   - Then he/she sends a 'ChangeActivityStatus' request (for example to 
resume Job processing),
   - The Execution Service immediately sends back a 
'ChangeActivityStatus' response describing acceptation or refusal.
-  As soon as the Job is 'Failed' or 'Finished with success or error', 
the Job Submitter receives from the Execution Service the appropriate 
notification.
-  Then, the Job Submitter may send a 'WipeActivity' request to purge 
the Job.
If Execution Service or Job Submitter does NOT implement notifications
----------------------------------------------------------------------
Then the Job submitter has to poll the Job status.
-  On job submission :
   - The Job Submitter sends a 'CreateActivity' request containing only 
1 parameter :  The vector of Job descriptions,
   - The Execution Service immediately sends back a 'CreateActivity' 
response containing Jobids (or error messages).
-  From time to time :
   - The Job Submitter sends a 'GetActivityStatus' request,
   - The Execution Service immediately sends back a 'GetActivityStatus' 
response describing the Job status and appropriate additional 
information (for example the location for manual file staging).
-  Whenever necessary (for example the Job status has just become 'Hold') :
   - The Job Submitter performs the appropriate work (for example manual 
file staging),
   - Then he/she sends a 'ChangeActivityStatus' request (for example to 
resume Job processing),
   - The Execution Service immediately sends back a 
'ChangeActivityStatus' response describing acceptation or refusal.
-  When the Job status has become 'Failed' or 'Finished with success or 
error', the Job Submitter may send a 'WipeActivity' request to purge the 
Job.
This method provides consistency with the 'Single Job State Model', but 
requires repetitive 'GetActivityStatus' requests.
Method minimizing 'GetActivityStatus' requests without notifications
--------------------------------------------------------------------
As far as I have understood from Aleksandr's explanations :
-  On job submission, the Job Submitter sends a 'CreateActivity' request 
containing only 1 parameter :  The vector of Job descriptions.
-  The Execution Service sends back a 'CreateActivity' response 
containing, for each Job :
   - Its Jobid (or error message),
   - If necessary, the location for file stage-in.
- If manual file stage-in is necessary :
   - The Job Submitter :
     - performs the manual file stage-in,
     - sends a 'ChangeActivityStatus' request (for example to resume Job 
processing).
   - The Execution Service sends back a 'ChangeActivityStatus' response 
describing acceptation or refusal.
-  From time to time :
   - The Job Submitter sends a 'GetActivityStatus' request,
   - The Execution Service immediately sends back a 'GetActivityStatus' 
response describing the Job status and appropriate additional 
information (for example the location for manual file stage-out).
-  Whenever necessary (for example the Job status has just become 
'Post-processing:Hold:Manual-Stage-Out') :
   - The Job Submitter performs the appropriate work (for example manual 
file stage-out),
   - Then he/she sends a 'ChangeActivityStatus' request (for example to 
resume Job processing),
   - The Execution Service sends back a 'ChangeActivityStatus' response 
describing acceptation or refusal (for example the Job status has become 
'Failed' or 'Finished with success or error').
-  When the Job status has become 'Failed' or 'Finished with success or 
error', the Job Submitter may send a 'WipeActivity' request to purge the 
Job.
This method minimizes 'GetActivityStatus' requests, but :
-  The time between the 'CreateActivity' request and the 
'CreateActivity' response (containing the location for file stage-in) 
can be very long (for example if the Job must stay a long time in the 
'Submitted' state waiting for computing and/or storage resources ).
-  Repetitive 'GetActivityStatus' requests are still necessary for the 
Job Submitter to learn that a Job has reached the 
'Post-processing:Hold:Manual-Stage-Out' state (or the 'Finished with 
success or error' state if no manual stage-out is necessary).
So, I can NOT guarantee the consistency of this method with the 'Single 
Job State Model'.
Please study the above 3 methods carefully, make up your mind, and send 
comments or remarks, so that we can together improve the design of the 
messages, and achieve consensus.
Besides, I will probably NOT be able to attend the PGI telephone 
conferences on 30 October and 06 November 2009.
Best regards.
-----------------------------------------------------
Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
                      Bat 200   91898 ORSAY    France
Tel: +33 1 64 46 84 87      Skype: etienne.urbah
Mob: +33 6 22 30 53 27      mailto:urbah@lal.in2p3.fr
-----------------------------------------------------
On Fri, 16 Oct 2009, Etienne URBAH wrote:
...
Balazs, Morris, Luigi, Johannes and all,
Concerning the 'AGU Execution Service Strawman Rendering' of OGF PGI 
and the telephone conference of last week on 09 October 2009 :
-  Many thanks to Morris for having given detailed explanations on 
chapter 2.1 'CreateActivity Operation'.
   I now much better understand what is described inside an 'operation'.
-  Many thanks to Johannes for the Report and for the Action list.
Consistency between the CreateActivity operation and the State Model
--------------------------------------------------------------------
Inside chapter 2.1 'CreateActivity operation', I found discrepancies 
between the current description of the 'CreateActivity' operation and 
the PGI Single Job State Model :
-  Inside the PGI Single Job State Model, the Execution Service :
   - Allocates a Jobid (or an EPR) to the Job and sends it back to the 
Submitter at the end of the 'Submitted' state, BEFORE any storage 
allocation could be performed,
   - Notifies the submitter with allocated storage resources for 
stage-in only inside the 'Pre-processing:Hold' state.
-  The current description of the 'CreateActivity' operation encompass 
both the 'Submitted' and 'Pre-processing' states, and describes that 
the response can contain information about storage resources for 
stage-in.
   In fact :
   - The 'CreateActivity' operation should be limited to the 
'Submitted' state, and the response can only be only a vector of 
Jobids (or EPRs).  Information about storage resources for stage-in 
can only be given later, through a 'GetActivityInfo' request or a 
notification to the submitter.
   - In order to permit notification, the 'CreateActivity' operation 
should allow an 'Notification EPR' as an additional optional input 
parameter.
I have updated the document with changes highlighted at 
http://forge.gridforum.org/sf/go/doc15628?nav=1
Hold substate inside the 'Submitted' state ?
--------------------------------------------
See mail below.
Best regards.
-----------------------------------------------------
Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
                      Bat 200   91898 ORSAY    France
Tel: +33 1 64 46 84 87      Skype: etienne.urbah
Mob: +33 6 22 30 53 27      mailto:urbah@lal.in2p3.fr
-----------------------------------------------------
On Thu, 13 Aug 2009, Etienne URBAH wrote:
...
Balazs, Morris and all,
Concerning the last OGF PGI telephone conference on 05 August 2009 :
Meeting notes
-------------
I see NO meeting notes about this telephone conference at 
http://forge.gridforum.org/sf/discussion/do/listTopics/projects.pgi-wg/discu...
So I am working with my own (fragmentary) notes.
For all future OGF PGI telephone conferences, is it possible that a 
secretary or a chair takes meeting notes, then writes them down in a 
understandable form, and publish them at the above mentioned page ?
Creation of a 'Submitted:Hold' substate ?
-----------------------------------------
First, as general rules, I consider that :
-  In order to AVOID keeping (potentially large) grid resources while 
NOT computing, grid Jobs should be designed to be processed 
completely automatically, with NO provision for 'Hold' substates,
-  A grid Job needing many 'Hold' substates can NOT be handled by an 
automatic Submitter, but should be submitted by a human grid User as 
an 'Interactive Job', as described for example at 
https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION000844000000...
Someone asked for the creation of a 'Hold' substate inside the 
'Submitted' state, like inside other states.
This 'Submitted:Hold' substate would make sense only if the Job 
Submitter could perform an operation on this substate.
In order to request such an operation, the Job Submitter needs the 
Jobid (or Job EPR).
This Jobid (or Job EPR) is guaranteed to be allocated by the 
Execution Service only at the END of the 'Submitted' state, but NOT 
before.
Therefore, I consider that the 'Submitted' state can NOT contain a 
'Hold' substate.
If anyone thinks otherwise, can he/she please present a convincing 
Use Case ?
Precisions about the 'Finished with Success or Error' state
-----------------------------------------------------------
Someone asked that the 'Error' case of the 'Finished with Success or 
Error' state should be moved to the 'Failed' state.
In fact, inside the current Job State Model, a Job reaches the 
'Finished with Success or Error' state if and only if it successively 
reached the end of following states, without failure or cancellation 
at the JOB level :
-  'Pre-processing'
-  'Delegated', whatever the Application result :
   - Success = Application return code equal     to zero
   - Error   = Application return code different of zero
-  'Post-processing'
Inside the 'Finished with Success or Error' state :
-  Success means 'Application return code was equal     to zero',
-  Error   means 'Application return code was different of zero'.
I copied this behavior from the Job State Model of 'gLite', where the 
'Done' state contains both the 'Success' and 'Exit Code !=0' cases, 
as can be seen in the 'bookkeeping information' at 
https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION000841000000...
I consider this behavior design, and the strong separation between 
the 'Failed' and 'Finished with Success or Error' states, as fully 
justified by following reasons :
-  Whenever a Job reaches the 'Failed' state, the grid Execution 
Service detected an unrecoverable inconsistency at the JOB level.
   Therefore, the Job output sandbox and the post-processed 
Application output files can potentially be NOT consistent and NOT 
even accessible by the Job Submitter.
   In order to investigate the Job failure, the grid User then needs 
some grid knowledge (and often experience and expertise) to retrieve 
and interpret :
   - the Job failure code and message,
   - the Job logging and bookkeeping, in comparison with the Job 
description.
   This 'grid level' investigation can sometimes prove that the cause 
of the Job failure came from the Application, but is ALWAYS necessary.
-  Whenever a Job reaches the 'Finished with Success or Error' state, 
the grid Execution Service could create the Job output sandbox, and 
perform post-processing on Application output files, WITHOUT 
detecting any unrecoverable inconsistency at the JOB level.
   Therefore, the Job output sandbox, and the post-processed 
Application output files, can be supposed to be consistent and easily 
accessible by the Job Submitter.
   On a non-zero return code of the Application, the grid User :
   - first has to look (WITHOUT needing any grid knowledge) at the 
Job output sandbox and at the post-processed Application output files 
for an Application problem,
   - before, if necessary, using grid knowledge (and often experience 
and expertise) to provide any evidence that the Application error was 
caused by a faulty Job description, the Batch system, or the grid 
Execution Service.
As a summary, I consider that the 'Error' case of the 'Finished with 
Success or Error' state should be kept as it is, and NOT be moved to 
the 'Failed' state.
If anyone thinks otherwise, can he/she please present convincing 
reasons ?
Strawman Rendering
------------------
I will work on the ODT version of 'Strawman Rendering' at 
http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :
-  include the above precisions on states,
-  include the 'Types of grid Jobs' section of my 'PGI Execution 
Service Overview' document,
-  check consistency, and present the relationships between the 
operations described in chapter 2 'Interface: Execution Port-Type' 
and the different states of the different types of grid Jobs.
Joining +9900827049931906 (plus perhaps Skype typing) on Friday 14 
August 2009 at 16h CET.
Best regards.
-----------------------------------------------------
Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
                      Bat 200   91898 ORSAY    France
Tel: +33 1 64 46 84 87      Skype: etienne.urbah
Mob: +33 6 22 30 53 27      mailto:urbah@lal.in2p3.fr
-----------------------------------------------------