Balazs, Morris and all,
Concerning the last OGF PGI telephone conference on 05 August 2009 :
Meeting notes
-------------
I see NO meeting notes about this telephone conference at
http://forge.gridforum.org/sf/discussion/do/listTopics/projects.pgi-wg/disc…
So I am working with my own (fragmentary) notes.
For all future OGF PGI telephone conferences, is it possible that a
secretary or a chair takes meeting notes, then writes them down in a
understandable form, and publish them at the above mentioned page ?
Creation of a 'Submitted:Hold' substate ?
-----------------------------------------
First, as general rules, I consider that :
- In order to AVOID keeping (potentially large) grid resources while
NOT computing, grid Jobs should be designed to be processed completely
automatically, with NO provision for 'Hold' substates,
- A grid Job needing many 'Hold' substates can NOT be handled by an
automatic Submitter, but should be submitted by a human grid User as an
'Interactive Job', as described for example at
https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084400000…
Someone asked for the creation of a 'Hold' substate inside the
'Submitted' state, like inside other states.
This 'Submitted:Hold' substate would make sense only if the Job
Submitter could perform an operation on this substate.
In order to request such an operation, the Job Submitter needs the Jobid
(or Job EPR).
This Jobid (or Job EPR) is guaranteed to be allocated by the Execution
Service only at the END of the 'Submitted' state, but NOT before.
Therefore, I consider that the 'Submitted' state can NOT contain a
'Hold' substate.
If anyone thinks otherwise, can he/she please present a convincing Use
Case ?
Precisions about the 'Finished with Success or Error' state
-----------------------------------------------------------
Someone asked that the 'Error' case of the 'Finished with Success or
Error' state should be moved to the 'Failed' state.
In fact, inside the current Job State Model, a Job reaches the 'Finished
with Success or Error' state if and only if it successively reached the
end of following states, without failure or cancellation at the JOB level :
- 'Pre-processing'
- 'Delegated', whatever the Application result :
- Success = Application return code equal to zero
- Error = Application return code different of zero
- 'Post-processing'
Inside the 'Finished with Success or Error' state :
- Success means 'Application return code was equal to zero',
- Error means 'Application return code was different of zero'.
I copied this behavior from the Job State Model of 'gLite', where the
'Done' state contains both the 'Success' and 'Exit Code !=0' cases, as
can be seen in the 'bookkeeping information' at
https://edms.cern.ch/file/722398//gLite-3-UserGuide.html#SECTION00084100000…
I consider this behavior design, and the strong separation between the
'Failed' and 'Finished with Success or Error' states, as fully justified
by following reasons :
- Whenever a Job reaches the 'Failed' state, the grid Execution Service
detected an unrecoverable inconsistency at the JOB level.
Therefore, the Job output sandbox and the post-processed Application
output files can potentially be NOT consistent and NOT even accessible
by the Job Submitter.
In order to investigate the Job failure, the grid User then needs
some grid knowledge (and often experience and expertise) to retrieve and
interpret :
- the Job failure code and message,
- the Job logging and bookkeeping, in comparison with the Job
description.
This 'grid level' investigation can sometimes prove that the cause
of the Job failure came from the Application, but is ALWAYS necessary.
- Whenever a Job reaches the 'Finished with Success or Error' state,
the grid Execution Service could create the Job output sandbox, and
perform post-processing on Application output files, WITHOUT detecting
any unrecoverable inconsistency at the JOB level.
Therefore, the Job output sandbox, and the post-processed
Application output files, can be supposed to be consistent and easily
accessible by the Job Submitter.
On a non-zero return code of the Application, the grid User :
- first has to look (WITHOUT needing any grid knowledge) at the Job
output sandbox and at the post-processed Application output files for an
Application problem,
- before, if necessary, using grid knowledge (and often experience
and expertise) to provide any evidence that the Application error was
caused by a faulty Job description, the Batch system, or the grid
Execution Service.
As a summary, I consider that the 'Error' case of the 'Finished with
Success or Error' state should be kept as it is, and NOT be moved to the
'Failed' state.
If anyone thinks otherwise, can he/she please present convincing reasons ?
Strawman Rendering
------------------
I will work on the ODT version of 'Strawman Rendering' at
http://forge.gridforum.org/sf/go/doc15628?nav=1 in order to :
- include the above precisions on states,
- include the 'Types of grid Jobs' section of my 'PGI Execution Service
Overview' document,
- check consistency, and present the relationships between the
operations described in chapter 2 'Interface: Execution Port-Type' and
the different states of the different types of grid Jobs.
Joining +9900827049931906 (plus perhaps Skype typing) on Friday 14
August 2009 at 16h CET.
Best regards.
-----------------------------------------------------
Etienne URBAH LAL, Univ Paris-Sud, IN2P3/CNRS
Bat 200 91898 ORSAY France
Tel: +33 1 64 46 84 87 Skype: etienne.urbah
Mob: +33 6 22 30 53 27 mailto:urbah@lal.in2p3.fr
-----------------------------------------------------
Dear all,
The next PGI teleconference will be held on Wednesday, 5th August
at 16:00 CET (duration: 1 hour)
Call-in details as follow:
via Skype call +9900827049931906 (free of charge) ordinary phone numbers
(local rates) with the 9931906 conference number:
Austria 0820 401 15470
Belgium 0703 57 134
France 0826 109 071
Germany +49 (0) 180 500 9527
Switzerland 0848 560 397
Agenda:
1) State model
a) recent updates
b) state model integration with the updated strawman rendering document
(the announcement on the update of the rendering document
http://forge.ogf.org/sf/go/doc15628 is copied below)
2) AOB
a) changing the teleconferencing environment?
b) PGI call schedule: Wednesdays or Fridays?, summer break?
Feel free to propose additional topics for discussion.
-----------------------
> Hi PGI team,
>
> some team members of the middlewares ARC, gLite, and UNICORE (AGU) found a
> bit time to update their input documents to the PGI process:
>
> (1)
> To ensure an open process, we put the updated version into the PGI OGF
> space:
>
> A) strawman_rendering (http://forge.ogf.org/sf/go/doc15628?nav=1)
>
> Note that the document has still some limitations such as:
> a) job state model is not fully integrated and thus contradicts to the rest
> of the rendering document since we also have some inputs for the discussion
> of it in the next weeks
> b) some formatting glitches: e.g. would be nice to have section headers,
> numbered, the table of content contain sections down to level 5.
> c) agreements only between ARC, gLite and UNICORE while a broader consensus
> with the other interested members of PGI is missing (i.e. GENESIS, EDGES)
> and to be discussed
> d) still some open questions that should be discussed in the broader PGI
> community
>
> B) strawman_functionality (http://forge.ogf.org/sf/go/doc15736?nav=1)
>
> Note that this document only got a minor update.
>
> C) agu_jsdl.xml (http://forge.ogf.org/sf/go/doc15737?nav=1)
>
> Focused on needed functionality - not existing JSDL with extensions.
>
> (2)
> Since we don't want to interrupt the currently ongoing fruitful state model
> discussions, we expect to continue with them and would suggest that members
> of PGI collect feedback about these updated documents over the next
> weeks/vacation period.
>
> Once the state model has been reasonable stable we can work on the feedback
> of the updated documents together and come to agreements between us all.
>
>
> (3)
> Since we don't like an emerging specification that is not really implemented
> we are about to start a reference implementation of it that might be
> demonstrated at OGF or other conferences, including its changes following
> from the open discussion process among the broader PGI community.
>
> With kind regards,
> Morris Riedel
> Balazs Konya
> Moreno Marzolla
bye,
Balazs
--
Balázs Kónya
NorduGrid Collaboration
http://www.nordugrid.org
Lund University balazs.konya(a)hep.lu.se
High Energy Physics phone: +46 46 222 8049
BOX 118, S - 221 00 LUND, Sweden fax: +46 46 222 4015
Morris Riedel wrote:
> Hi PGI team,
>
> I was informed that the participants of the last telcon agreed on next
> meeting to be on Friday 16:00 CET, so there is no telcon today!
Actually, it was decided that we return to the Friday 16pm (CET) schedule
starting from this week. The regular PGI calls will be on Fridays and not on
Wednesdays.
bye,
Balazs
--
Balázs Kónya
NorduGrid Collaboration
http://www.nordugrid.org
Lund University balazs.konya(a)hep.lu.se
High Energy Physics phone: +46 46 222 8049
BOX 118, S - 221 00 LUND, Sweden fax: +46 46 222 4015
Hi PGI team,
I was informed that the participants of the last telcon agreed on next
meeting to be on Friday 16:00 CET, so there is no telcon today!
Take care,
Morris
------------------------------------------------------------
Morris Riedel
SW - Engineer
Distributed Systems and Grid Computing Division
Jülich Supercomputing Centre (JSC)
Forschungszentrum Juelich
Wilhelm-Johnen-Str. 1
D - 52425 Juelich
Germany
Email: m.riedel(a)fz-juelich.de
Info: http://www.fz-juelich.de/jsc/JSCPeople/riedel
Phone: +49 2461 61 - 3651
Fax: +49 2461 61 - 6656
Skype: MorrisRiedel
"We work to better ourselves, and the rest of humanity"
Sitz der Gesellschaft: Jülich
Eingetragen im Handelsregister des Amtsgerichts Düren Nr. HR B 3498
Vorsitzende des Aufsichtsrats: MinDirig'in Bärbel Brumme-Bothe
Vorstand: Prof. Dr. Achim Bachem (Vorsitzender),
Dr. Ulrich Krafft (stellv. Vorsitzender)
Dear all,
the next PGI teleconference will be held tomorrow july 15th
at 16:00 CET (duration: 1 hour).
Call-in details as follow:
via Skype call +9900827049931906 (free of charge) ordinary phone numbers
(local rates) with the 9931906 conference number:
Austria 0820 401 15470
Belgium 0703 57 134
France 0826 109 071
Germany +49 (0) 180 500 9527
Switzerland 0848 560 397
The agenda is similar to the last call (in particular, tomorrow I would
like that we spend a few minutes on a status update of the security
discussion)
1) State model
2) Status update on security
3) AOB
Feel free to propose additional topics for discussion.
Moreno.
--
Moreno Marzolla
INFN Sezione di Padova, via Marzolo 8, 35131 PADOVA, Italy
EMail: moreno.marzolla(a)pd.infn.it Phone: +39 049 8277103
WWW : http://www.dsi.unive.it/~marzolla Fax : +39 049 8756233