
Balazs, Morris, Luigi, Johannes and all PGI members, Concerning OGF PGI Uses Cases and Scope : UML diagrams ------------ I have updated inside http://forge.gridforum.org/sf/docman/do/listDocuments/projects.pgi-wg/docman... the source file and the pictures of UML Class and Collaboration Diagrams (designed with the ArgoUML tool) : - showing INTERFACES of Distributed Data Processing which are necessary or useful to standardize, - permitting to assess the impact of SCOPE and architecture on the list of INTERFACES which are absolutely required to be standardized to permit minimum interoperability. For each interface, I have added the known relevant standard(s) inside square brackets. Inside the Collaboration Diagrams, arrows in RED depict relationships NOT using interfaces. These relationships hinder interoperability and should be avoided when possible. Terminology / Vocabulary ------------------------ - INSTRUMENT : Anything generating large sets of scientific data (HEP experiment, Medical imaging, Observation instrument, ...) as taken into account by the DORII project - PAYLOAD : Any Grid or Cloud Entity directly useful for a scientist (Data, Activity, Instrument, ...) - SUPPORT : Any other Grid or Cloud Entity (Security, Info, Application, License, VM image, Log, Accounting, ...) Security, Info, Log and Accounting are absolutely needed to operate a Production Grid. Files ----- http://forge.gridforum.org/sf/go/doc15977?nav=1 ArgoUML source file http://forge.gridforum.org/sf/go/doc15978?nav=1 Class Diagram http://forge.gridforum.org/sf/go/doc15979?nav=1 Abstract Collab. Diagram http://forge.gridforum.org/sf/go/doc15980?nav=1 Detailed Collab. Diagram http://forge.gridforum.org/sf/go/doc15981?nav=1 Scope with BES + JSDL http://forge.gridforum.org/sf/go/doc15982?nav=1 Interoperability with a Monolithic Execution Service Standardization priorities -------------------------- The collaboration diagrams show in particular that : - The interfaces for Activity Management are only connected to the Instrument Manager, the Activity Managers and the Computing Resources. So, it is easy to build a gateway to bridge Activity Management between 2 completely different Grid or Cloud infrastructures (The EDGeS 3G bridge is in full operation between gLite- and BOINC-powered infrastructures). Therefore, Activity Management alone does NOT require urgent standardization. - The interfaces for Security, Info, Log and Accounting are directly connected to most functionalities. Therefore, interoperability between Production Grids absolutely require urgent standardization of these interfaces. Criticism, remarks and suggestions are welcome. Best regards. ----------------------------------------------------- Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS Bat 200 91898 ORSAY France Tel: +33 1 64 46 84 87 Skype: etienne.urbah Mob: +33 6 22 30 53 27 mailto:urbah@lal.in2p3.fr ----------------------------------------------------- On Fri, 14/05/2010 18:03, Etienne URBAH wrote:
Balazs, Morris, Luigi, Johannes and all PGI members,
Lot of thanks to Oxana, Aleksandr and Andrew for their contribution to OGF PGI :
We finally come to what we all should have begun with, that is the USE CASES and the SCOPE.
+-------------+ | USE CASES | +-------------+ I suggest that :
- We all provide clear descriptions of Use Cases inside http://forge.gridforum.org/sf/docman/do/listDocuments/projects.pgi-wg/docman...
In particular, Use Cases for the five LHC categories described by Oxana are welcome. Could the already published GROMACS Use Case be included inside one LHC category ?
- Each Use Case (in particular the GROMACS Use Case) must clearly indicate if the required file stagings should be performed : - automatically by the Execution Service (from locations specified inside the Job Description), or - manually by the Submitter of the Activity, which then requires to be in a 'Hold' state, and its 'session directory' (or alike) published.
- We describe these Use Cases with a graphical UML tool, such as ArgoUML (which runs on MS-Windows, MAC OS X and Linux), and publish the corresponding source files inside GridForge (open collaborative process).
- Simple example of Use Case : The Submitter of an Activity MAY be a scientist using a Scientific Grid Portal permitting to submit predefined Applications. Therefore, this Activity Submitter MAY have very little knowledge of Grids, and MAY have very little knowledge of the Application executed as Payload on the Computing Resource.
+---------+ | SCOPE | +---------+ Execution of Activities or Jobs is only a little part of the 'Distributed Data Processing' functionalities that a Production Grid is required to provide.
In particular, we MUST clearly understand the difference of 'Quality of Service' required by : - Transient entities (such as Activities or Jobs), which MAY fail at any time for any reason, - Persistent entities (such as Grid resource descriptors, Security descriptors, Data sets, Accounting Records, Log records, ...), which SHOULD be securely kept.
I will publish very soon the source file and the pictures of UML Collaboration Diagrams (designed with the above mentioned ArgoUML tool) showing :
- INTERFACES of Distributed Data Processing which are necessary or useful to standardize
- The impact of ARCHITECTURE on the list of INTERFACES which absolutely require to be standardized to permit minimum interoperability.
+------------------------+ | FOUNDATION STANDARDS | +------------------------+ In order to ease mutual understanding and general agreement, we need Foundation Standards as sound basis.
Requirement NF6 (162) : JSPG (Security Policies) ------------------------------------------------- Without agreement on AUTHN and AUTHZ, there is simply NO interoperability.
Requirement IS1 (1) : GLUE model --------------------------------- - At the Amsterdam meeting, someone said about GLUE : 'Information model does not concretely say anything'. - I have the totally opposite opinion : I strongly suggest to use the GLUE model (currently GLUE 2.0) as one of these Foundation Standards, and to describe as much concepts as possible using GLUE entities.
+----------------------------+ | TERMINOLOGY / VOCABULARY | +----------------------------+ For clarity, I suggest following definitions :
Client ------ Holder of credentials belonging to a member of a GLUE UserDomain. For example, a Client MAY submit an Activity to an Execution Service, or query the Status of an Activity.
Activity = Job (fully synonyms) -------------- Remote processing which a Client describes in a 'Job Description', which the Client then submits to an Execution Service.
Payload ------- Anything (Application, Script, Pilot Job, ...) executed by a Computing Resource on request of the Activity. The payload MAY completely ignore that is is executed inside a grid Activity.
Simple Activity --------------- Simple Job Description containing only ONE local job executed by only ONE batch system, WITHOUT 'Hold' states NOR manual staging. - This is a suggestion of restrictive evolution of requirement JM5 (55). - An Activity requiring 'Hold' states and/or manual staging is then called 'Interactive Activity'.
Criticism, remarks and suggestions are welcome.
Best regards.
----------------------------------------------------- Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS Bat 200 91898 ORSAY France Tel: +33 1 64 46 84 87 Skype: etienne.urbah Mob: +33 6 22 30 53 27 mailto:urbah@lal.in2p3.fr -----------------------------------------------------
On 13/05/2010 21:00, Oxana Smirnova wrote:
Hi Andrew, all,
allow me to start from very starters, to explain "typical" workload Aleksandr referred to.
Both ARC and gLite "grew" from the requirements of the High Energy Physics community, more specifically - those of LHC experiments. I'll come back later to what it means in practice.
The basic difference between gLite and ARC starting requirements is that gLite is designed for resources owned or controlled to a large extent by their users, while ARC is designed to support resources shared between different users and controlled by fairly independent resource providers.
The immediate difference is policies: while gLite community is largely expected to comply with policies devised by the LHC Grid (WLCG) Joint Security and Policy Group, ARC community has no unique set of policies. ARC sites that contribute to LHC computing in general tend to respect the WLCG policies, but not too close, giving priority to local policies of resource owners. Needless to say, this introduces extra complexity into requirements, and reduces the rate of simple use cases.
Now, LHC experiments are huge communities, between 500 and 3000 members in each. All well-educated computer-savvy people who never hesitate coming with their own brilliant solutions. Even within one experiment divergence is huge, and there are sites that support 4 experiments. This adds to the complexity: not only we have diverging or contradictory requirements from resource owners, we also have all sorts of diverging requirements from users. The least common denominator is 1: meaning "ping", because even "hello world" is ambiguous - where do you send the output, to a file or to a standard output? If it is standard output, do you write it to a log? Who said "hello world", the individual user or the whole experiment? Do we have a right to log individual activities at all? And so on.
There are several basic kinds of typical jobs by LHC experiments, in general they can be separated in 5 categories:
1. Monte Carlo generation (no input, small resource consumption, small output) 2. Detector simulation (one input file, moderate resource consumption, moderate output) 3. Signal reconstruction (multiple input files, moderate resource consumption, large output) 4. Data merging (very many input files - like, 400, large resource consumption, large output) 5. Data analysis (huge amount of input files not necessarily known in advance, small resource consumption, small output)
The job can have a number of states, e.g.:
1. Job is defined (may require authorisation) 2. Job matched to a site (requires authorisation, detailed site information, maybe data availability information) 3. Data are made available to a job (authorisation, probably delegation of data staging rights to a staging service) 4. Job processes data (CPU, I/O, network access to external databases requiring authorisation) 5. Job places data at [multiple] pre-defined destinations (authorisation, contacting external databases, probably delegation to a staging service) 6. Job is finished 7. Job has failed 8. Job is terminated (requires authorisation) 9. Job is resubmitted (authorisation, information)
Each state may have a number of sub-states, depending on the experiment-specific framework.
Authorisation may be per Virtual Organisation (file reading), per Role and/or Group within a VO (job types 1-4), per person (job type 5), or even per service (some frameworks accept services as VO members).
Delegation of rights in general is very much needed, because of the large number of auxiliary services, distributed environment and general complexity of the workflow. No-one really knows how to achieve the goals without delegation.
Each experiment has their own framework. Most such frameworks circumvent Grid services because they are too generic. This means that jobs are highly non-trivial as they attempt to re-implement Grid services such as information gathering and publishing, data movement and registration, resubmission etc.; they also tend to tweak authorisation by executing payload of users not authorised by Grid means. This complicates even further the job state picture.
If PGI's outcome will make any of the above mentioned jobs impossible, most key ARC and gLite customers will not use PGI specs, and they will have only academic value. This was not the PGI goal, as "P" stands for "Production".
Cheers, Oxana
13.05.2010 15:20, Andrew Grimshaw пишет:
Aleksandr, Referring to your sentence/paragraph
"Such "simple" job is very far from being "typical". At least in NorduGrid world AFAIK."
Could you elaborate. I see in my work basically two "types" of jobs that dominate - sets of HTC "parameter space" jobs, and true parallel MPI jobs. In both cases the "job" is basically a command line - either an mpiexec/run, and application with parameters, or a script with parameters. The job has known inputs and outputs, or a directory tree location where it needs to run. The jobs runs to completion, or it fails, in either case there are output files and result codes. Sometimes the job is a workflow, but when you pick that apart it turns into jobs that have inputs and outputs along with a workflow engine orchestrating it all.
What is a typical job that you see? When I say "typical" I mean covers 80% of the jobs.
A
-----Original Message----- From: Aleksandr Konstantinov [mailto:aleksandr.konstantinov@fys.uio.no] Sent: Sunday, May 02, 2010 3:36 PM To: pgi-wg@ogf.org Cc: Andrew Grimshaw; 'Oxana Smirnova'; 'Etienne URBAH'; 'David SNELLING'; lodygens@lal.in2p3.fr Subject: Re: [Pgi-wg] OGF PGI Requirements - Flexibility and clarity versus Rigidity and confusion
Hello,
I agree that problem is to difficult to solve. One should take into account that initially task was different. Originally AFAIR it was an attempt of few grid project to make a common interface suitable for them. Later those were forced to OGF and problem upgraded to almost unsolvable.
Andrew Grimshaw at Saturday 01 May 2010 15:36 wrote:
Oxana, Well said.
I would add that I fear we may be trying too much to solve all problems the first time around - to "boil the ocean". To completely solve the whole problem is a daunting task indeed as there are so many issues.
I personally believe we will make more progress if we solve the minimum problem first, e.g., securely run a simple job from infrastructure/sw-stack A on infrastructure/sw-stack B.
This problem is already solved. And it was done in few ways. 1. Client stacks supporting multiple service stacks 2. BES + GSI 3. Other combinations currently in use And none is fully suitable for real production. So unless task of PGI is considered to be purely theoretical this approach would become equal to one more delay.
"Infrastructure/sw-stack A" means a set of resources (e.g., true parallel-Jugene, clusters, sets of desktops) running a middleware stack (e.g., Unicore 6 or Arc) configured a particular way. In the European context this might mean an NGI such as D-Grid with Unicore 6 running a job on NorduGrid running Arc. (Please forgive me if I have the particulars of the NGIs wrong.)
"Simple job" means a job that is typical, not special. This is not to say that its resource requirements are simple, it may have very particular requirements (cores per socket, interconnect, memory), rather I mean that the job processing required is simple: run w/o staging, simple staging,
Such "simple" job is very far from being "typical". At least in NorduGrid world AFAIK.
perhaps client interaction with the session directory pre, post, and during execution. Try to avoid complex job state models that will be hard to agree on, and difficult to implement in some environments.
"Securely" means sufficient authentication information required at B is provided to B in a form it will accept from a policy perspective. Further, that we try as much as possible to avoid a delegation definition that extends inwards beyond the outer boundary of a particular infrastructure/sw-stack.
I'm lost. Is is delegation or definition which extends?
(The last sentence is a bit awkward, I personally think that we will need to have two models of authentication and delegation - a legacy transport layer mechanism, and a message layer mechanism based on SAML, and that inside of a software stack we cannot expect sw-stacks to change their internal delegation mechanism.)
I believe authentication/delegation is the most critical item: if we cannot get the authentication/delegation issues solved, the rest is moot with respect to a PRODUCTION environment. We may be able to do demo's and stunts while punting on authentication/delegation, but we will not integrate production systems.)
Wasn't delegation voted no during last review?
A.K.
participants (1)
-
Etienne URBAH