Re: [Pgi-wg] OGF PGI Use Cases and Scope

21 May 2010

      Balazs, Morris, Luigi, Johannes and all PGI members,

Concerning OGF PGI Uses Cases and Scope :

UML diagrams
------------
I have updated inside 
http://forge.gridforum.org/sf/docman/do/listDocuments/projects.pgi-wg/docman... 
the source file and the pictures of UML Class and Collaboration Diagrams 
(designed with the ArgoUML tool) :

-  showing INTERFACES of Distributed Data Processing which are necessary 
or useful to standardize,

-  permitting to assess the impact of SCOPE and architecture on the list 
of INTERFACES which are absolutely required to be standardized to permit 
minimum interoperability.

For each interface, I have added the known relevant standard(s) inside 
square brackets.

Inside the Collaboration Diagrams, arrows in RED depict relationships 
NOT using interfaces.  These relationships hinder interoperability and 
should be avoided when possible.

Terminology / Vocabulary
------------------------
-  INSTRUMENT :  Anything generating large sets of scientific data (HEP 
experiment, Medical imaging, Observation instrument, ...) as taken into 
account by the DORII project

-  PAYLOAD :  Any Grid or Cloud Entity directly useful for a scientist 
(Data, Activity, Instrument, ...)

-  SUPPORT :  Any other Grid or Cloud Entity (Security, Info, 
Application, License, VM image, Log, Accounting, ...)
     Security, Info, Log and Accounting are absolutely needed to operate 
a Production Grid.

Files
-----
http://forge.gridforum.org/sf/go/doc15977?nav=1 ArgoUML source file

http://forge.gridforum.org/sf/go/doc15978?nav=1 Class Diagram

http://forge.gridforum.org/sf/go/doc15979?nav=1 Abstract Collab. Diagram

http://forge.gridforum.org/sf/go/doc15980?nav=1 Detailed Collab. Diagram

http://forge.gridforum.org/sf/go/doc15981?nav=1 Scope with BES + JSDL

http://forge.gridforum.org/sf/go/doc15982?nav=1 Interoperability with a 
Monolithic Execution Service

Standardization priorities
--------------------------
The collaboration diagrams show in particular that :

-  The interfaces for Activity Management are only connected to the 
Instrument Manager, the Activity Managers and the Computing Resources.
     So, it is easy to build a gateway to bridge Activity Management 
between 2 completely different Grid or Cloud infrastructures  (The EDGeS 
3G bridge is in full operation between gLite- and BOINC-powered 
infrastructures).
     Therefore, Activity Management alone does NOT require urgent 
standardization.

-  The interfaces for Security, Info, Log and Accounting are directly 
connected to most functionalities.
     Therefore, interoperability between Production Grids absolutely 
require urgent standardization of these interfaces.

Criticism, remarks and suggestions are welcome.

Best regards.

-----------------------------------------------------
Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
                        Bat 200   91898 ORSAY    France
Tel: +33 1 64 46 84 87      Skype: etienne.urbah
Mob: +33 6 22 30 53 27      mailto:urbah@lal.in2p3.fr
-----------------------------------------------------

On Fri, 14/05/2010 18:03, Etienne URBAH wrote:
...
Balazs, Morris, Luigi, Johannes and all PGI members,
Lot of thanks to Oxana, Aleksandr and Andrew for their contribution to
OGF PGI :
We finally come to what we all should have begun with, that is the USE
CASES and the SCOPE.
+-------------+
|  USE CASES  |
+-------------+
I suggest that :
- We all provide clear descriptions of Use Cases inside
http://forge.gridforum.org/sf/docman/do/listDocuments/projects.pgi-wg/docman...
In particular, Use Cases for the five LHC categories described by Oxana
are welcome.
Could the already published GROMACS Use Case be included inside one LHC
category ?
- Each Use Case (in particular the GROMACS Use Case) must clearly
indicate if the required file stagings should be performed :
- automatically by the Execution Service (from locations specified
inside the Job Description), or
- manually by the Submitter of the Activity, which then requires to be
in a 'Hold' state, and its 'session directory' (or alike) published.
- We describe these Use Cases with a graphical UML tool, such as ArgoUML
(which runs on MS-Windows, MAC OS X and Linux), and publish the
corresponding source files inside GridForge (open collaborative process).
- Simple example of Use Case :
The Submitter of an Activity MAY be a scientist using a Scientific Grid
Portal permitting to submit predefined Applications.
Therefore, this Activity Submitter MAY have very little knowledge of
Grids, and MAY have very little knowledge of the Application executed as
Payload on the Computing Resource.
+---------+
|  SCOPE  |
+---------+
Execution of Activities or Jobs is only a little part of the
'Distributed Data Processing' functionalities that a Production Grid is
required to provide.
In particular, we MUST clearly understand the difference of 'Quality of
Service' required by :
- Transient entities (such as Activities or Jobs), which MAY fail at any
time for any reason,
- Persistent entities (such as Grid resource descriptors, Security
descriptors, Data sets, Accounting Records, Log records, ...), which
SHOULD be securely kept.
I will publish very soon the source file and the pictures of UML
Collaboration Diagrams (designed with the above mentioned ArgoUML tool)
showing :
- INTERFACES of Distributed Data Processing which are necessary or
useful to standardize
- The impact of ARCHITECTURE on the list of INTERFACES which absolutely
require to be standardized to permit minimum interoperability.
+------------------------+
|  FOUNDATION STANDARDS  |
+------------------------+
In order to ease mutual understanding and general agreement, we need
Foundation Standards as sound basis.
Requirement NF6 (162) : JSPG (Security Policies)
-------------------------------------------------
Without agreement on AUTHN and AUTHZ, there is simply NO interoperability.
Requirement IS1 (1) : GLUE model
---------------------------------
- At the Amsterdam meeting, someone said about GLUE : 'Information model
does not concretely say anything'.
- I have the totally opposite opinion : I strongly suggest to use the
GLUE model (currently GLUE 2.0) as one of these Foundation Standards,
and to describe as much concepts as possible using GLUE entities.
+----------------------------+
|  TERMINOLOGY / VOCABULARY  |
+----------------------------+
For clarity, I suggest following definitions :
Client
------
Holder of credentials belonging to a member of a GLUE UserDomain.
For example, a Client MAY submit an Activity to an Execution Service, or
query the Status of an Activity.
Activity = Job (fully synonyms)
--------------
Remote processing which a Client describes in a 'Job Description', which
the Client then submits to an Execution Service.
Payload
-------
Anything (Application, Script, Pilot Job, ...) executed by a Computing
Resource on request of the Activity. The payload MAY completely ignore
that is is executed inside a grid Activity.
Simple Activity
---------------
Simple Job Description containing only ONE local job executed by only
ONE batch system, WITHOUT 'Hold' states NOR manual staging.
- This is a suggestion of restrictive evolution of requirement JM5 (55).
- An Activity requiring 'Hold' states and/or manual staging is then
called 'Interactive Activity'.
Criticism, remarks and suggestions are welcome.
Best regards.
-----------------------------------------------------
Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS
Bat 200 91898 ORSAY France
Tel: +33 1 64 46 84 87 Skype: etienne.urbah
Mob: +33 6 22 30 53 27 mailto:urbah@lal.in2p3.fr
-----------------------------------------------------
On 13/05/2010 21:00, Oxana Smirnova wrote:
...
Hi Andrew, all,
allow me to start from very starters, to explain "typical" workload
Aleksandr referred to.
Both ARC and gLite "grew" from the requirements of the High Energy
Physics community, more specifically - those of LHC experiments. I'll
come back later to what it means in practice.
The basic difference between gLite and ARC starting requirements is that
gLite is designed for resources owned or controlled to a large extent by
their users, while ARC is designed to support resources shared between
different users and controlled by fairly independent resource providers.
The immediate difference is policies: while gLite community is largely
expected to comply with policies devised by the LHC Grid (WLCG) Joint
Security and Policy Group, ARC community has no unique set of policies.
ARC sites that contribute to LHC computing in general tend to respect
the WLCG policies, but not too close, giving priority to local policies
of resource owners. Needless to say, this introduces extra complexity
into requirements, and reduces the rate of simple use cases.
Now, LHC experiments are huge communities, between 500 and 3000 members
in each. All well-educated computer-savvy people who never hesitate
coming with their own brilliant solutions. Even within one experiment
divergence is huge, and there are sites that support 4 experiments.
This adds to the complexity: not only we have diverging or contradictory
requirements from resource owners, we also have all sorts of diverging
requirements from users.
The least common denominator is 1: meaning "ping", because even
"hello world" is ambiguous
- where do you send the output, to a file or to a standard output?
If it is standard output, do you write it to a log?
Who said "hello world", the individual user or the whole experiment?
Do we have a right to log individual activities at all? And so on.
There are several basic kinds of typical jobs by LHC experiments, in
general they can be separated in 5 categories:
1. Monte Carlo generation (no input, small resource consumption, small
output)
2. Detector simulation (one input file, moderate resource consumption,
moderate output)
3. Signal reconstruction (multiple input files, moderate resource
consumption, large output)
4. Data merging (very many input files - like, 400, large resource
consumption, large output)
5. Data analysis (huge amount of input files not necessarily known in
advance, small resource consumption, small output)
The job can have a number of states, e.g.:
1. Job is defined (may require authorisation)
2. Job matched to a site (requires authorisation, detailed site
information, maybe data availability information)
3. Data are made available to a job (authorisation, probably delegation
of data staging rights to a staging service)
4. Job processes data (CPU, I/O, network access to external databases
requiring authorisation)
5. Job places data at [multiple] pre-defined destinations
(authorisation, contacting external databases, probably delegation to a
staging service)
6. Job is finished
7. Job has failed
8. Job is terminated (requires authorisation)
9. Job is resubmitted (authorisation, information)
Each state may have a number of sub-states, depending on the
experiment-specific framework.
Authorisation may be per Virtual Organisation (file reading), per Role
and/or Group within a VO (job types 1-4), per person (job type 5), or
even per service (some frameworks accept services as VO members).
Delegation of rights in general is very much needed, because of the
large number of auxiliary services, distributed environment and general
complexity of the workflow. No-one really knows how to achieve the goals
without delegation.
Each experiment has their own framework. Most such frameworks circumvent
Grid services because they are too generic. This means that jobs are
highly non-trivial as they attempt to re-implement Grid services such as
information gathering and publishing, data movement and registration,
resubmission etc.; they also tend to tweak authorisation by executing
payload of users not authorised by Grid means. This complicates even
further the job state picture.
If PGI's outcome will make any of the above mentioned jobs impossible,
most key ARC and gLite customers will not use PGI specs, and they will
have only academic value. This was not the PGI goal, as "P" stands for
"Production".
Cheers,
Oxana
13.05.2010 15:20, Andrew Grimshaw пишет:
...
Aleksandr,
Referring to your sentence/paragraph
"Such "simple" job is very far from being "typical". At least in
NorduGrid world AFAIK."
Could you elaborate. I see in my work basically two "types" of jobs
that dominate - sets of HTC "parameter space" jobs, and true parallel
MPI jobs.
In both cases the "job" is basically a command line - either an
mpiexec/run, and application with parameters, or a script with
parameters.
The job has known inputs and outputs, or a directory tree location
where it needs to run.
The jobs runs to completion, or it fails, in either case there are
output files and result codes.
Sometimes the job is a workflow, but when you pick that apart it turns
into jobs that have inputs and outputs along with a workflow engine
orchestrating it all.
What is a typical job that you see?
When I say "typical" I mean covers 80% of the jobs.
A
-----Original Message-----
From: Aleksandr Konstantinov [mailto:aleksandr.konstantinov@fys.uio.no]
Sent: Sunday, May 02, 2010 3:36 PM
To: pgi-wg@ogf.org
Cc: Andrew Grimshaw; 'Oxana Smirnova'; 'Etienne URBAH';
'David SNELLING'; lodygens@lal.in2p3.fr
Subject: Re: [Pgi-wg] OGF PGI Requirements - Flexibility and clarity
versus Rigidity and confusion
Hello,
I agree that problem is to difficult to solve.
One should take into account that initially task was different.
Originally AFAIR it was an attempt of few grid project to make a
common interface suitable for them. Later those were forced to OGF
and problem upgraded to almost unsolvable.
Andrew Grimshaw at Saturday 01 May 2010 15:36 wrote:
...
Oxana,
Well said.
I would add that I fear we may be trying too much to solve all
problems the first time around - to "boil the ocean".
To completely solve the whole problem is a daunting task indeed
as there are so many issues.
I personally believe we will make more progress if we solve the
minimum problem first, e.g., securely run a simple job from
infrastructure/sw-stack A on infrastructure/sw-stack B.
This problem is already solved. And it was done in few ways.
1. Client stacks supporting multiple service stacks
2. BES + GSI
3. Other combinations currently in use
And none is fully suitable for real production.
So unless task of PGI is considered to be purely theoretical this
approach would become equal to one more delay.
...
"Infrastructure/sw-stack A" means a set of resources (e.g., true
parallel-Jugene, clusters, sets of desktops) running a middleware
stack (e.g., Unicore 6 or Arc) configured a particular way.
In the European context this might mean an NGI such as D-Grid with
Unicore 6 running a job on NorduGrid running Arc.
(Please forgive me if I have the particulars of the NGIs wrong.)
"Simple job" means a job that is typical, not special.
This is not to say that its resource requirements are simple, it may
have very particular requirements (cores per socket, interconnect,
memory), rather I mean that the job processing required is simple:
run w/o staging, simple staging,
Such "simple" job is very far from being "typical". At least in
NorduGrid world AFAIK.
...
perhaps client interaction with the session directory pre, post, and
during execution.
Try to avoid complex job state models that will be hard to agree on,
and difficult to implement in some environments.
"Securely" means sufficient authentication information required at B
is provided to B in a form it will accept from a policy perspective.
Further, that we try as much as possible to avoid a delegation
definition that extends inwards beyond the outer boundary of a
particular infrastructure/sw-stack.
I'm lost. Is is delegation or definition which extends?
...
(The last sentence is a bit awkward, I personally think that we will
need to have two models of authentication and delegation
- a legacy transport layer mechanism,
and a message layer mechanism based on SAML, and that inside of a
software stack we cannot expect sw-stacks to change their internal
delegation mechanism.)
I believe authentication/delegation is the most critical item:
if we cannot get the authentication/delegation issues solved,
the rest is moot with respect to a PRODUCTION environment.
We may be able to do demo's and stunts while punting on
authentication/delegation, but we will not integrate production
systems.)
Wasn't delegation voted no during last review?
A.K.

Etienne URBAH

tags

participants (1)