Simple Job Management

14 Dec 2006

      Dear SAGA members,

Based on the API document currently in public comment phase, we have 
implemented a very simple version of the Job Management API. Basically 
we stripped the underlying SAGA model and when directly to Job 
Management API. See the attachment UML graph for details. The reason was 
that we did not have enough time to implement the full SAGA core to 
support the API and we focus on the NAREGI Super Scheduler (SS).

Consideration and simplifications:
- We do not support the suspended state at this time (see below for 
details).
- The wait method is reserved in java for signal synchronization and is 
not used in the same way; we did not implement a real wait method since 
we are not interested in synchronization of jobs at the moment. This 
might be equivalent of the Thread.join() method.
- Metrics handling has not been added. This might come with future 
incarnation of the package.
- Job_self is not supported, also we could pretend it is the same as job 
in java.
- We do not include all the methods of the job_service so far in the 
factory, might come in later incarnations and rename the factory in service.
- Checkpoint and migrate are not supported for now. In two models it has 
no sense and the SS seems not to support it yet.
- Signal is not supported, internally some implementations have it. But 
again the SS does not.
- Many attributes are not supported at the moment.
- Session and security model ignored (the SS has his own model) others 
don't care.
- A job description can take strings, collections and a string arrays as 
arguments. Other formats are allowed if the caller knows how to 
manipulate them. Properties that are known to use string arrays have 
direct assessors to facilitate their access. All properties can be 
stored as a single string to allow serialization.

Since we are working in Java we decided to go pure pattern oriented. So 
an application has access to the factory, job pattern and a Job 
description. The concrete Job stubs implementations are not supposed to 
be exposed (but are accessible since the class is public).

Now the design is made so that we can later on hock the SAGA core 
classes below the current API without breaking (to much) the code 
(assuming we hold on our design pattern approach).

We have three concrete implementations of Jobs: Local, SSH and Super 
Scheduler.
- The local job uses the java process object and handles a job on the 
same machine as the JVM. This job type does not support suspended mode 
at all. This is a fully synchronous job since all actions are taken on 
the spot, unless you submit the job on a queuing system.
- The SSH is a remote job incarnation, the job can run on any machine 
that has the SSH daemon running, this can be a synchronous job, unless 
you submit the job on a queuing system. This job type does not support 
suspended mode for the moment and only POSIX systems can be used to 
launch the job.
- The Super Scheduler is NAREGI specific and uses NAREGI’s middleware. 
This is an asynchronous job. Suspended mode cannot be directly handled 
even if the state exists in the SS, so this is still pending. This job 
produces internally WSDL documents; the necessary methods are private 
however.

General comments and questions. Might be some meat for the public 
comments as well:

Now we stumbled upon the state machine of the API. The "Unknown" and 
"New" state are unclear to us. In our opinion when you create a job 
either with the factory or directly with the constructor of the specific 
incarnation, we enter the "New" state. The "Unknown" state is now 
reserved for the very short time the object is instantiated but we 
directly switch to "New" once the constructor is finished. The principle 
in OO programming is to have a stable object once you finish 
constructing it and calling method; if the constructor is not enough to 
have a stable object you need a factory. So when you get an object back 
it should be in a stable state, thus the "Unknown" state is superficial 
in our opinion.

Some metrics or attributes or the Job are useless since they come 
directly from the descriptor, Example: "ExecutionHosts", 
"WorkingDirectory" or "CPUTimeLimit". Unless you consider that these 
values might be different from the job description. Or if the job 
description don't mention them the job can have this values assigned by 
the back-end. Either case the API documentation should clarify this.

The run_job from the service will not follow the API contract if 
implemented. Only one parameter can be returned in java. Also the 
streams are available thought the Job pattern.

In the document section 3.8.8 Examples the example at line 16 and 17 is 
wrong (or the method is overwritten). There should be no string 
argument. The host should be set in the descriptor.

-- 

Best regards,
Pascal Kleijer

----------------------------------------------------------------
   HPC Marketing Promotion Division, NEC Corporation
   1-10, Nisshin-cho, Fuchu, Tokyo, 183-8501, Japan.
   Tel: +81-(0)42/333.6389       Fax: +81-(0)42/333.6382

Pascal Kleijer

Andre Merzky

Pascal Kleijer

Andre Merzky

tags

participants (2)