Balazs, Morris and all,
Concerning the last OGF PGI telephone conference on 05 August 2009 :
Meeting notes
-------------
I see NO meeting notes about this telephone conference at
http://forge.gridforum.org/sf/discussion/do/listTopics/projects.pgi-wg/disc…
So I am working with my own (fragmentary) notes.
For all future OGF PGI telephone conferences, is it possible that a
secretary or a chair takes meeting notes, then writes them down in a
understandable form, and publish them at the above mentioned page ?
Creation of a 'Submitted:Hold' substate ?
-----------------------------------------
First, as general rules, I consider that :
- In order to AVOID keeping (potentially large) grid resources while
NOT computing, grid Jobs should be designed to be processed completely
automatically, with NO provision for 'Hold' substates,
- A grid Job needing many 'Hold' substates can NOT be handled by an
automatic Submitter, but should be submitted by a human grid User as an
'Interactive Job', as described for example at
https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084400000…
Someone asked for the creation of a 'Hold' substate inside the
'Submitted' state, like inside other states.
This 'Submitted:Hold' substate would make sense only if the Job
Submitter could perform an operation on this substate.
In order to request such an operation, the Job Submitter needs the Jobid
(or Job EPR).
This Jobid (or Job EPR) is guaranteed to be allocated by the Execution
Service only at the END of the 'Submitted' state, but NOT before.
Therefore, I consider that the 'Submitted' state can NOT contain a
'Hold' substate.
If anyone thinks otherwise, can he/she please present a convincing Use
Case ?
Precisions about the 'Finished with Success or Error' state
-----------------------------------------------------------
Someone asked that the 'Error' case of the 'Finished with Success or
Error' state should be moved to the 'Failed' state.
In fact, inside the current Job State Model, a Job reaches the 'Finished
with Success or Error' state if and only if it successively reached the
end of following states, without failure or cancellation at the JOB level :
- 'Pre-processing'
- 'Delegated', whatever the Application result :
- Success = Application return code equal to zero
- Error = Application return code different of zero
- 'Post-processing'
Inside the 'Finished with Success or Error' state :
- Success means 'Application return code was equal to zero',
- Error means 'Application return code was different of zero'.
I copied this behavior from the Job State Model of 'gLite', where the
'Done' state contains both the 'Success' and 'Exit Code !=0' cases, as
can be seen in the 'bookkeeping information' at
https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084100000…
I consider this behavior design, and the strong separation between the
'Failed' and 'Finished with Success or Error' states, as fully justified
by following reasons :
- Whenever a Job reaches the 'Failed' state, the grid Execution Service
detected an unrecoverable inconsistency at the JOB level.
Therefore, the Job output sandbox and the post-processed Application
output files can potentially be NOT consistent and NOT even accessible
by the Job Submitter.
In order to investigate the Job failure, the grid User then needs
some grid knowledge (and often experience and expertise) to retrieve and
interpret :
- the Job failure code and message,
- the Job logging and bookkeeping, in comparison with the Job
description.
This 'grid level' investigation can sometimes prove that the cause
of the Job failure came from the Application, but is ALWAYS necessary.
- Whenever a Job reaches the 'Finished with Success or Error' state,
the grid Execution Service could create the Job output sandbox, and
perform post-processing on Application output files, WITHOUT detecting
any unrecoverable inconsistency at the JOB level.
Therefore, the Job output sandbox, and the post-processed
Application output files, can be supposed to be consistent and easily
accessible by the Job Submitter.
On a non-zero return code of the Application, the grid User :
- first has to look (WITHOUT needing any grid knowledge) at the Job
output sandbox and at the post-processed Application output files for an
Application problem,
- before, if necessary, using grid knowledge (and often experience
and expertise) to provide any evidence that the Application error was
caused by a faulty Job description, the Batch system, or the grid
Execution Service.
As a summary, I consider that the 'Error' case of the 'Finished with
Success or Error' state should be kept as it is, and NOT be moved to the
'Failed' state.
If anyone thinks otherwise, can he/she please present convincing reasons ?
Strawman Rendering
------------------
I will work on the ODT version of 'Strawman Rendering' at
http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :
- include the above precisions on states,
- include the 'Types of grid Jobs' section of my 'PGI Execution Service
Overview' document,
- check consistency, and present the relationships between the
operations described in chapter 2 'Interface: Execution Port-Type' and
the different states of the different types of grid Jobs.
Joining +9900827049931906 (plus perhaps Skype typing) on Friday 14
August 2009 at 16h CET.
Best regards.
-----------------------------------------------------
Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS
Bat 200 91898 ORSAY France
Tel: +33 1 64 46 84 87 Skype: etienne.urbah
Mob: +33 6 22 30 53 27 mailto:urbah@lal.in2p3.fr
-----------------------------------------------------
Dear all,
this mail starts the thread about the lease operation.
Please have a look at the document INFN-TC-09-3.pdf (pp. 12, 15)
available at link http://www.lnf.infn.it/sis/preprint/detail.php?id=5147
in which is described the concept of the lease mechanism in CREAM.
best regards,
Luigi
Dear all,
please see the notes from today's meeting below.
Best,
Johannes
Participants:
Luigi, David, Johannes, Morris, Etienne
1) Look at the minutes from last time.
2) Discussion:
createActivity action
createMultiplteActivity
David: ordinary JSDL? - means not going to a new JSDL
-> meet in the middle
David:
SRM, SRB
ftp, gridftp?
Morris:
ftp, gridftp still in scope
SRB partly the same approach
document initial version not covering data staging
overall overview, two month out. only few comments from Etienne
no crucial comments
need for reference implementation
PGI profile on top of BES 2.0
proceed with the reference implementation and get out with the PGI
writings in parallel to the implementation
parts from Andrew, parts from other standards
Go to document:
Execution port type:
Any question to the execution port type interface?
Continue with 2nd chapter:
Action for Balazs and Morris:
GLUE 2 tunings, exectution port type
Action for Balazs and Morris:
fault mechanism
createActivity: discussing of possible fault
parameter sweep
not agree to transaction approach, not wanted currently
browsing all epr
Morris:
postpone transaction
go into the vector operation
Luigi:
transaction is not important at the moment
all jobs are single jobs, not interested in grouping jobs at the moment
no reason to change transaction
Etienne:
out of scope
Morris:
it is not only grouping - it is a bit more
would not call out of scope - not every partner is represented today
2)
createActivity operation
stresses changeability of BES, OGSA-BES2.0
how much BES has to be tweaked to move to our approach
in the reference implementation epr
is there some problem with this?
no
data staging
how could data staging work?
extend a little bit
partly covered by PGI service
extend the usage of credentials
Action for Morris: security aspects for credentials in data staging
typical way of doing
JSDL
not invent something new
provide more protocols
hierarchical approach to provide more functionality
data push from a client perspective
users from ARC, gLite mentioned this
partly related to the state model
Action for Etienne:
how much of the createActivity operation is covered by the state model?
(client initiated data staging)
other side effect:
hold points
continue operation
postpone and continue operation
keep out of the question now
focus on different aspects
figure of the process published (Morris)
Action for Morris:
createActivite figure of process should be delivered
Action for Morris: ask Shahbaz dataurl createActivity
Requests:
fearly easy
AGU JSDL working title
quite a long discussion what activities are valid
look behind the response
JSDL element is dropped - no error
other implementations return an error if dropped.
have a look on Job description document validation
XML valid or not?
next level: check document according to the schema
is the sematics the same as we understand?
important to check the semantic
services are not really supporting parts of description
should have somewhere the information what kind of data staging
information is provided
agreed to have this list. how this can be nailed down
Action to all: what is meant by service capabilities?
link to GLUE2? influence?
final discussion was what is understood by match making
never ignore things
if we take this JSDL doc - in the past simply ignored it
leading to the point -> never a fault -> user thought everything is fine
consensus: not ignore things any more
submit state:
Action for Etienne and Luigi: have a look at submit state
email with answer
email discussion will go on.
Question to Lugi:
last tbd in createactivity: lease?
Luigi:
each job has an attribute for basically the time to live of the job
renewed by client as long the client and the service can talk
when lease expires the job is removed from both sides
Action for Luigi: email:
Subject createActivity operation, lease
start email thread
AOB?
Etienne: lease?
what is the relationship with the timeline of the proxy
Luigi:
not related, but depends on the implementation
proxy can be renewed and lease can also be renewed
no true relationship between lease and proxy
-> include in email.
Actions for Johanens:
action list also on email
remember people before next meeting, tasks
David:
push Morris comments on BES
--
_ _ _ _ _ _ Johannes Watzl
|\/| |\ | |\/| Institut für Informatik / Dept. of CS
| | | \| | | Ludwig-Maximilians-Universität München
======= TEAM ======= Oettingenstr. 67, 80538 Munich, Germany
Room D0.5, Phone +49-89-2180-9162
Munich Network Management Team Email: watzl(a)nm.ifi.lmu.de
Münchner Netz-Management Team http://www.nm.ifi.lmu.de/~watzl
Dear all,
please see below the actions defined in the meeting today
Best,
Johannes
- Action for Balazs and Morris:
GLUE 2 tunings, exectution port type
Action for Balazs and Morris:
fault mechanism
in the reference implementation epr
- Action for Morris: security aspects for credentials in data staging
- Action for Etienne:
how much of the createActivity operation is covered by the state model?
(client initiated data staging)
- Action for Morris:
createActivite figure of process should be delivered
- Action for Morris: ask Shahbaz dataurl createActivity
- Action to all: what is meant by service capabilities?
link to GLUE2? influence?
- Action for Etienne and Luigi: have a look at submit state
email with answer
email discussion will go on.
- Action for Luigi: email:
Subject createActivity operation, lease
start email thread
--
_ _ _ _ _ _ Johannes Watzl
|\/| |\ | |\/| Institut für Informatik / Dept. of CS
| | | \| | | Ludwig-Maximilians-Universität München
======= TEAM ======= Oettingenstr. 67, 80538 Munich, Germany
Room D0.5, Phone +49-89-2180-9162
Munich Network Management Team Email: watzl(a)nm.ifi.lmu.de
Münchner Netz-Management Team http://www.nm.ifi.lmu.de/~watzl
Dear All,
The next call is scheduled according to our usual timetable:
16:00 CET, Friday 9th October
Proposed agenda:
Continue the discussion of the AGU strawman document which was started on last
Friday (btw, Johannes please send out the meeting notes).
In particular the CreateActivity and ChangeStatusActivity operations will be
discussed. The later will pull in the statemodel as well.
The document is available from the gridforge, v0.36 has just been uploaded which
contains minor editorial fixes:
AGU Strawman_rendering
http://forge.gridforum.org/sf/docman/do/downloadDocument/projects.pgi-wg/do…
Call details (the usual hidefconferencing facility):
Friday, 9th October at 16:00 CET (duration: 1 hour)
via Skype call +9900827049931906 (free of charge) ordinary phone numbers
(local rates) with the 9931906 conference number:
Austria 0820 401 15470
Belgium 0703 57 134
France 0826 109 071
Germany +49 (0) 180 500 9527
Switzerland 0848 560 397
regrds,
Balazs
--
Balázs Kónya
NorduGrid Collaboration
http://www.nordugrid.org
Lund University balazs.konya(a)hep.lu.se
High Energy Physics phone: +46 46 222 8049
BOX 118, S - 221 00 LUND, Sweden fax: +46 46 222 4015
Dear All,
This is a reminder about the 2nd October, Friday PGI call where we are going to
start the discussion of the v.035 of the Strawman rendering document:
http://forge.ogf.org/sf/go/doc15628?nav=1
Call details (the usual hidefconferencing facility):
Friday, 2nd October at 16:00 CET (duration: 1 hour)
via Skype call +9900827049931906 (free of charge) ordinary phone numbers
(local rates) with the 9931906 conference number:
Austria 0820 401 15470
Belgium 0703 57 134
France 0826 109 071
Germany +49 (0) 180 500 9527
Switzerland 0848 560 397
Regards,
Balazs Konya
Morris Riedel wrote on 2009-09-10:
> Hi PGI team,
>
> we agreed in one of the last telcons to have a rather long comment period on
> the PGI input documents mentioned below in my first e-mail from 31 of July.
>
> We also discussed to start the discussions from mid + end of September,
> however, having not yet received any comments, we extend the group internal
> comment period until start of October. But then it's really time to proceed
> with our efforts.
>
> Hence, that means we should have the discussions about this PGI input
> documents on..
>
> 2nd October 16:00 CET
>
> Please plan accordingly and we encourage folks to provide extensive comments
> in advance so that the telcons can run more effectively.
>
> Take care, Morris & Balázs
>
> -----Original Message----- From: pgi-wg-bounces(a)ogf.org
> [mailto:pgi-wg-bounces@ogf.org] On Behalf Of Morris Riedel Sent: Friday, July
> 31, 2009 12:30 PM To: pgi-wg(a)ogf.org Subject: [Pgi-wg] Update of some PGI
> input documents
>
> Hi PGI team,
>
> some team members of the middlewares ARC, gLite, and UNICORE (AGU) found a
> bit time to update their input documents to the PGI process:
>
> (1) To ensure an open process, we put the updated version into the PGI OGF
> space:
>
> A) strawman_rendering (http://forge.ogf.org/sf/go/doc15628?nav=1)
>
> Note that the document has still some limitations such as: a) job state model
> is not fully integrated and thus contradicts to the rest of the rendering
> document since we also have some inputs for the discussion of it in the next
> weeks b) some formatting glitches: e.g. would be nice to have section
> headers, numbered, the table of content contain sections down to level 5. c)
> agreements only between ARC, gLite and UNICORE while a broader consensus with
> the other interested members of PGI is missing (i.e. GENESIS, EDGES) and to
> be discussed d) still some open questions that should be discussed in the
> broader PGI community
>
> B) strawman_functionality (http://forge.ogf.org/sf/go/doc15736?nav=1)
>
> Note that this document only got a minor update.
>
> C) agu_jsdl.xml (http://forge.ogf.org/sf/go/doc15737?nav=1)
>
>
> Focused on needed functionality - not existing JSDL with extensions.
>
>
> (2) Since we don't want to interrupt the currently ongoing fruitful state
> model discussions, we expect to continue with them and would suggest that
> members of PGI collect feedback about these updated documents over the next
> weeks/vacation period.
>
> Once the state model has been reasonable stable we can work on the feedback
> of the updated documents together and come to agreements between us all.
>
> (3) Since we don't like an emerging specification that is not really
> implemented we are about to start a reference implementation of it that might
> be demonstrated at OGF or other conferences, including its changes following
> from the open discussion process among the broader PGI community.
>
> With kind regards, Morris Riedel Balazs Konya Moreno Marzolla
--
Balázs Kónya
NorduGrid Collaboration
http://www.nordugrid.org
Lund University balazs.konya(a)hep.lu.se
High Energy Physics phone: +46 46 222 8049
BOX 118, S - 221 00 LUND, Sweden fax: +46 46 222 4015