Dear Andre,

Many thanks for your reply. I think the SAGA mailing list may be the best place to discuss this issue.

Currently the SAGA specification has minimal support for workflows. However, this may not be ideal case for users where applications exhibit some kind of workflow. Most of the compute intensive applications (especially high throughout applications) require a workflow functionality in the SAGA specification due to the following reasons:

1. It is not an easy task to submit individual tasks in a workflow as single jobs. There may be hundred of tasks in a workflow that may run on dozens of datasets. This will require a user to generate hundred of jobs. Workflow tools facilitate this operation where users create workflows and submit it as a single entity. A resource management system interprets and distributes the workflow to distributed resources.

 If there is no workflow support in the SAGA specification, the creation and submission of jobs (workflows, bulk jobs etc) may be a nightmare task for users.

 2. Performance: If jobs are submitted one by one (as is the case in SAGA), the associated latencies may be very high. For example, if a particular neuroimaging workflow has 50 tasks and these run on 50 datasets, we require 2500 jobs to do the required processing. Now if a single job takes 2 minute (could be 45 minutes in some middlewares) before it can get a slot for execution, we will be facing a minimum delay of around 83 hours before all the jobs actually start execution. This may not be tolerable where users are concerned about performance and throughput.

3. Management and monitoring: Once users submit jobs, they will be monitoring the outcome of these jobs. There may be job failures, there may be  some unresolved dependencies and even some jobs may not provide the required output due to some other errors. Managing and monitoring of hundreds of jobs is not a trivial scenario.

On the other hand, if a workflow is submitted as a single entity, the users will be managing just one job and then the underlying resource management system (such as glite WMS or Grid Gateway)may be splitting the jobs into tasks, providing the monitoring  information to users and will present a single result to the users.

 
There may be number of other scenarios where a workflow support in SAGA is needed. I am making here a case that SAGA can get a wide scale adoption in number of scientific and business communities if the workflow support is made  available.

Tthere may be number of ways to support workflows in SAGA. Ideally,  the SAGA API should generate a JDL/JSDL from a workflow that can then be interpreted by an underlying Grid middleware.

1. Large jobs can execute the whole workflows, but we will be forcing the whole job to be scheduled on a single site. This may create more problems as we are suggesting the job to transfer all datasets to a single site.  This may also overload some sites and may virtually minimize the role of meta-schedulers, which were created for multi-site scheduling.

 2. Pilot job may help in performance optimization but it is not clear how can they execute a whole workflow, especially if a workflow is distributed on more than one sites. If a workflow is distributed, pilot jobs have no mechanism where they can communicate and coordinate across sites.

 3. A DAG enacter could be interesting but how it will coordinate with the underlying Grid resources remains to be investigated.

We may discuss the possibility of a workflow adapter that is abstract in nature and this should help the resulting job descriptions to be executed in different enactment environments. SAGA should generate generic workflow descriptions (JSDL/JDL) and the adapters can be extended to support the enactment functionality. Alternatively, SAGA can provide a partial enactment engine, however, how it should be executed needs an open debate. Yet another scenario could be to translate workflow descriptions into a SAGA API, which may then automatically dispatch them to underlying adapters.

We do not want to break the standard by suggesting an adapter to support workflows, however, we need some mechanism that the SAGA implementations should support DAGS and other workflows.

Best Regards

Ashiq Anjum

 



From: Andre Merzky <andre@merzky.net>
Date: 8 October 2009 19:10:02 GMT+01:00
To: Irfan Habib <irfan.habib@cern.ch>
Cc: Andre Merzky <andre@merzky.net>, <gat-devel@cct.lsu.edu>
Bcc: Andre Merzky <andre@merzky.net>
Subject: Re: [Gat-devel] Proposal for changes in the glite-adaptor
Reply-To: Andre Merzky <andre@merzky.net>

Hi Irfan,

BTW: nice to hear from you again! :-)  Greetings to the other guys
in Bristol!

Quoting [Irfan Habib] (Oct 08 2009):

Dear Andre,

I completely understand your motivation, however to me the glite-
adaptor is not equal to the other adaptors. Because the  glite-adaptor  
does not work at the "same level" as the other adaptors that are part  
of the SAGA and javaGAT package. All other adaptors work at the site  
level, where the user selects a specific site gateway to submit a job  
to (SGE, condor, gridsam are all middleware for site based clusters).  
If the glite-adaptor submitted jobs to a glite-CE rather than glite-
WMS we can consider creating our own workflow enactor on top of SAGA,  
and that would be a SAGA compliant solution.

However, the glite-adaptor interacts with a Grid-level service (glite-
WMS) and hence a single job submission incurs significant amount of  
overhead. The testing Grid environment I have access to, incurs a  
submission overhead of 2-4 minutes and production EGEE environment has  
2 - 45 minutes. If I were to enact a workflow of tasks (tasks that are  
a mutually dependant) to the Grid all the job latencies that will  
accumulate will severely impact the workflow turnaround time.
To address this issue glite-WMS enables a user to define a workflow in  
a JDL. You can author a JDL which includes a DAG workflow and submit  
the JDL to glite-WMS, this sends a single request to glite-WMS, and  
submission overheads are incurred only once rather than repeatedly for  
large workflows. Of course users may want to test the workflow in a  
local cluster of SGE or Condor this can be easily supported through  
the use of adaptors and SAGA based workflow enactor.
Our motivation is  purely performance oriented.

FWIW, there are other backends which expose similar latencies.  Our
EC2 adaptor for example submits jobs to a Virtual Machine instance -
if that instance needs to be newly created, submission has a latency
of up to 5 minutes.  We also try to enact workflows on EC2 (actually
Teragrid + EC2), so we face quite similar problems as you do, I
expect.

We have three different solutions for the problem:

(i) We try to use large jobs, for which the startup time becomes
relatively small.  That is of course not always possible.

(ii) We built a PilotJob infrastructure on top of SAGA, which
requires one pilot job per requested rsource to get submitted once
(with the latency penalty, but you can of course overlap that
bootstrapping for all resources).  That pilot job is then some kind
of private gateway, which executes the real workflow nodes for us.

I am not sure if our pilot job implementation is of any use for you,
as its implemented in Python on top of C++, so probably well outside
of your tool chain, but let me know if you want to check it out, and
I get you in contact with the developer.

(iii) not as much a solution, but rather an approach: we started to
implement a DAG enactor in SAGA, which basically acts as a service,
so replaces your glite-WMS.  The DAG is then specified in an
additional Workflow Extension to SAGA (which is yet to be defined).
As the DAG enactor can access resources from within the system (in a
pilot job like manner), it avoids most submission related latencies.

We do have an prototype for that - again in C++, and most likely not
functional enough for you to be of any use.  But if you (or others
on this list) are interested in pursuing that route, in particular
in respect to defining a SAGA extension package, please let me know,
and I'll keep you posted.


[...] and that would be a SAGA compliant solution.

Actually, it would not.  I absolutely undertsnd your issues, and
agree that you need a solution to be able to use JavaSAGA/JavaGAT
sensibly, but exposing backend details on SAGA API level is breaking
SAGA compliance, by definition - sorry...


To the GAT list admins: please let us know if this discussion is off
topic for the list, and we move it elsewhere...

Cheers, and thanks, Andre.


Best Regards,
Irfan Habib


On 8 Oct 2009, at 08/10/2009,16:11, Andre Merzky wrote:

Hi Irfan,

FWIW, I agree with Max here: if you can't express your job with the
SAGA means, then that may be a deficiency in SAGA (or GAT) - but
adding backend specific methods or properties on API level defeats
the purpose of a generic API...

Out of curiosity: what parts of the JDS specifically do you have
trouble with?

Cheers, Andre.



Quoting [Max Berger] (Oct 08 2009):

Hi Irfan Habib,

two things:

- If you write and submit the patch (or send it to me) I'll review it
and apply. (I'm the current 'maintainer' for the gLite adaptor).

BUT

this would defeat the purpose of the JavaGAT layer. There are already
too many gLite specfic settings to use, and this would mean adding  
even
more peculiarities.

So, what I'd prefer is a solution which uses JavaGAT and maps the job
structure there to the actual job structure. The features you've
requested are called "CoScheduleJob" in JavaGAT:

http://www.cs.vu.nl/ibis/javadoc/javagat/org/gridlab/gat/resources/CoScheduleJob.html

Please consider this alternative. It would make portability to other
Grid environments much easier in the future.


Max

Irfan Habib schrieb:
Hi,

I have been able to submit jobs through JavaSAGA with the glite-
adaptor.
However, the kind of jobs that can be supported through the adaptor
currently are fairly simple atomic executables. For our project we
require more complex JDL jobs (JDL DAG workflows, JDL parametric  
jobs
etc).
One way forward for us is to extend the glite-adaptor to accept JDL
jobs from the users. For instance one way to implement this would be
to introduce another attribute in the Preference context. If the
concerned attribute it defined the glite-adaptor instead of  
generating
a JDL from scratch, uses the JDL that has been passed to the
JobDescription.EXECUTABLE, for instance.
Such changes, in our opinion, add to the capabilities of the gLite-
adaptor and enable it to be used for more complex gLite specific  
jobs.

Would such changes be acceptable to the javaGAT glite-adaptor
developers? Can such changes be included in the trunk?

Best Regards,
Irfan


On 7 Oct 2009, at 07/10/2009,22:53, Ceriel Jacobs wrote:

Irfan Habib wrote:
Hi,

Well according to the svn log, the changes have been committed,
however the directory attribute is still being set.
I hope to have fixed it now. The attribute was still being set
because the WORKINGDIRECTORY entry of JobDescription has a default
value, which I forgot about. SAVE_STATE should also be fixed now.

Best wishes,

Ceriel
_______________________________________________
Gat-devel mailing list
Gat-devel@cct.lsu.edu
https://mail.cct.lsu.edu/mailman/listinfo/gat-devel


_______________________________________________
Gat-devel mailing list
Gat-devel@cct.lsu.edu
https://mail.cct.lsu.edu/mailman/listinfo/gat-devel
--
Nothing is ever easy.





--
Nothing is ever easy.