Re: [Pgi-wg] OGF PGI Session 2 and 3 at OGF 33 on Wednesday 21 September - Draft Minutes

21 Sep 2011

      Steve,

Concerning OGF PGI Session 2 and 3 at OGF 33 on Wednesday 21 September :

Thank you very much for your 'Draft Minutes' :  They are readable, 
understandable and quite accurate.

I suggest following improvements for the beginning of Session 3 :

Can you replace  "If cancelled -> [end], it's purged." by
"[end] means purged.  Add failed -> pending (automatic resubmission if 
requested inside JSDL)."

Thank you in advance.

Best regards

-----------------------------------------------------
Etienne URBAH         LAL, Univ Paris-Sud, IN2P3/CNRS
                       Bat 200   91898 ORSAY    France
Tel: +33 1 64 46 84 87      Skype: etienne.urbah
Mob: +33 6 22 30 53 27      mailto:urbah@lal.in2p3.fr
-----------------------------------------------------

On Wed, 21/09/2011 15:28, Steve Crouch wrote:
...
14:00-13:30 PGI Session 3
-------------------------
Chair: Andrew Grimshaw
Minutes: Steve Crouch
Attendance:
Name shortcuts:
   AG - Andrew Grimshaw
   EU - Etiennce Urbah
   DK - Daniel Katz
   OS - Oxana Smirnov
   JPN - JP Navarro
   KS - Katsushige Saga
   SC - Steve Crouch
   DM - David Meredith
New actions:
Minutes
Discussion of state model - how can we accommodate additional states?
EU: doc number is 16306 under OGSA-BES WG on GridForge.
[Page 31, fig 6]
EU: take same states as original BES, add in transitions. Add purge
finished ->  [end], add failure transition pending ->  failed.
Next line to be replaced :
If cancelled ->  [end], it's purged.
[end] means purged.  Add failed -> pending (automatic resubmission if 
requested inside JSDL).
AG: this is change in underlying state model.
EU: could drop it, it exists on one system. If it doesn't fit, we drop it.
EU: next diagram for held states [fig 7, pg 34]. Based on original BES,
only added some transitions.
AG: if specified to go into suspended state initially in JSDL, it starts
in suspend until told to resume ('proceeding'). Transitions from
suspended/proceeding to both cancelled and failed. Still need to add a
transition. Automatic resubmission perhaps not put in.
AG: other one is extended job state [referring to session slide 3] from
RENKEI.
KS: EU covered this in more detail yesterday. (Discussion of incoming
queues for pre-processing).
JPN: other terms meaning same thing: pre-processing:pending,
pre-processing:complete.
AG: are we trying to go to a different state model? We'e handled this as
specified in the spec, which has substates dealing with staging in and
out. e.g. staging-in is pre-processing state. Also running:executing is
a substate. Didn't we want to stay with the basic state model and extend
via substates? This state model is extended.
EU: not compatible with basic state model?
KS: mapping exists from this to basic state model substates [next slide].
AG: good profile some of these substates - we need to know what these
substates actually mean. Are these proposed JSDL extensions?
KS: this implementation is just for testing. We don't care deeply about
extensions.
AG: you name a substate where you want it to stop?
KS: yes.
AG: thought something similar to EU's idea. In original state model, two
place for helds (need pre and post): in running (as EU did), in pending,
and terminated held/finished held failed/held. But this is subsumed into
EU's.
    Do we want to give advice for how to order additional substates? Do
we recommend these in proceeding or suspended? If we use substates as in
data staging, execution in running where to put the helds?
SM: better to have a named state for data staging pre/post. Confusing
for client otherwise.
AG: optional client initiated staging hold. Letting client do whatever
they want to do. Dont want to hold for all purposes.
Mike: you know it's held for pre-processing.
AG: what does delegated refer to? Delegated queue for queued, delegated
running for running.
SM: want to have pre and post for delegated.
EU: names not well chosen.
AG: suggest change names for some of these states.
SM: delgated - incoming is queued.
AG: delegated should be processing. Delegated:incoming should be
delegated:queued-in. Processing:outgoing (renamed) not visible to
outside, used as transitory state.
Mike: no error around it, it's mandatory right?
AG: if somebody were to subscribe to notification about this, ...
AG: processing:running, processing:hold.
    If no queueing system, just executes - can it go directly to running.
SM: debugging problematic. Held not implemented for all middlewares.
AG: not sure I like this (confusion with states of running jobs for
held/running for executing jobs).
SM: have u seen EMI execution service? Can take some of these from there.
AG: keep same 5 states.
SM: have attributes for substates.
AG: need to agree on these attributes. To avoid having to guess on
substate meanings.
SM: many reasons for including more info in substates e.g. errors -
storage is down, etc.
EU: doc - 10th March on pgi list. PGI execution specification.
[Discussion about pg 20 and state definitions - pre-processing and
processing]
AG: mapping of these states to our discussed states.
SM discusses how this mapping works.
AG: referring to current state model...
    Agreed yesterday not to change state model otherwise, have to change
spec. Profile substates, don't have to.
SM: too simple to represent all things. Problems with users to
distinguish clearly with states.
AG: use [substates] and no pushback from users.
Mike: but we'd have to change the spec.
AG: want to expedite the process. If it can be done in context of the
existing state model, it vastly simplifies things unless it's too ugly.
SM having multiple states.
AG: how do we feel about this.
SM: many substates of failed - how do we identify that?
SC: this would break state model, new model, should use existing spec
where possible.
SM: make concept smoother - define running, focus on pre/main/post
processing.
AG: we went around the state discussions before, with substates you can
model everything.
    Doing a whole new spec would take longer.
SM: one more possible point - model from EU/KS they have this community
requirement.
AG: NAREGI state model - all these collapse into running.
SM: limiting, should find a better way.
AG: can't go backwards and change the document.
SM: staging as substate - what does this mean?
Mike: it may need to change anyway. A cleaner way would be to redo this.
SM: substates of failed, terminate...
AG: legitimate to do this in model.
MM: problem we found was no additional additional transition to failure
state. It can occur at any moment. We introduced artificial transition
to failure state.
    Afraid about too much substating of running state - other substates
have more informative meaning e.g. running: data-staging.
AG: if we do a new version of the spec base it on existing spec, and
just change the bits we need e.g. FactoryAttributes, vector operations,
but keep it simple, otherwise it will escalate - genie out of the bottle.
    SM is right - various people have substated this in very similar ways
in implementations - substates in running. This is fine, can even have k
layers deep of substates. What are transitions introduced? To error states?
AG: too much of a decision to reach today. If we redo the spec, we
timebox it. If not, we will profile. Otherwise it will get into a
problem of arguing about stuff as in PGI.
JPN: use cases for these requirements?
AG: held states got us here. They wanted client-initiated held states
for data staging, then can say 'go'. Then be able to stop it afterwards.
No requirement for us, since data staging is automated, or done manually.
JPN: requirement for all BES to support thee states?
AG: specified in JSDL, if you don't support it, you throw a fault.
EU: not in favour of manual staging in complicated state model. Have
documented it in documentation so people can understand these
complexities. Advocate existing BES states, just adding few transitions
that are missing e.g. pending ->  failed.
AG: pending ->  failed could be done as an addendum e.g. it was
forgotten, ok. Adding new states is different.
SM: have given this requirement to PGI and EMI.
EU: have a requirement to users. They should think about workflow model
before trying to send jobs.
SM: our users are expecting this held, manual staging. We shoulnd't
perhaps introduce hold mechanism, just implement simple flags. But
doesn't have to be supported.
EU: do we have to implement new state model.
SM: no. Different issue.
EU: related to post/pre processing.
AG: hold and manual can be substated away. SM wants to restate
misleading substates.
SM: e.g. running:queued.
EU: just rename.
DM: haven't heard a convincing argument for these pre/post processing
substates...
AG: it's ugly, a bit of ugly at a time. Term for this: technical debt.
The easy path, some point in future clean it up, not today - fine with
me! Not me who will have to implement.
    SM from UNICORE. PGI and EMI had this as their requirements. PGI/WG
had this as a requirement driven from gLite and ARC. But now UNICORE?
SM: yes.
AG: how functionality differnet from telling them to copy data in first?
SM: job starts, get EPR&  storage service reference, use this to stage
data into. Can upload files/directory into this, once done, start job
physically.
AG: 3 choices
   1) We stay on profile only path, incorporate reqs fully later on
   2) Bite bullet and do the spec rewrites
   3) Addendum - additional profile to stick changes in only
AG: addendum to add state ok, changing state space wouldn't ba an
addendum as such.
    Other than UNICORE, others want to profile.
    UNICORE - do it properly, rewrite.
Need input from others - gLite, CREAM, SAGA, NorduGrid, etc. Need to get
this approach right.
SM: first priority data staging profiling ok, state model - secondary.
Thirdly (personal) BES - meaning of substates from user's point of view.
    Rename running as processing.
AG: addendum?
Mike: can we do all this in profiles?
AG: don't think vectors can be done this way. Is it important? Not sure
- may slide on important scale. Those that wanted it have walked away
from the table.
EU: in EDGI, we're implementing handling of vectors. Tell users if you
want vector, you explain in separate file internally and generate
internally as many jobs as necessary using existing client interfaces.
AG: support in our client implementation by generating multiple internal
JSDLs from a single one, returning single EPR. If you send list, we'll
accept and start them all.
    Param sweep spec never says what to return.
SM: rename states.
AG: perhaps not a bad idea. Notion of addendums to specs; very easily done.
    a) transition from pending to failed.
    b) rename running to processing.
AG: can have conditional - support both running and processing (but they
mean the same thing).
    On to the substate model. Assuming processing, not running. Perhaps
not enough time. Back and forth transitions are problematic, not sure
it's way to go. Alternative way would be to have hold states within
processing; not clear how existing imps that break down running into
substates would handle this. Look at RENKEI's mapping.
EU: see previews drawing.
KS: we implement pre-processing/post-processing only for data staging.
Maybe original PGI specification state model comes from strawman. Also
doc says pre-post used for data staging.
AG: think they are.
KS: implement this state model. In our system we have workflow system,
with this system, we can transfer data directory from and to computer
resources. We implemented these states.
EU: pre/post processing not only for manual but automatic data staging.
AG: right.
    Not a strictly requirement to support suspend/resume.
AG: consensus to take running renamed as processing, change from
running:stage-in/running:stage-out to new processing states.
SM: how to do this resume in service?
AG: discussed yesterday, new port type.
On Wed, 21/09/2011 12:30, Steve Crouch wrote:
...
11:00-12:30 PGI Session 2
-------------------------
Chair: Andrew Grimshaw
Minutes: Steve Crouch
Attendance: ~16
Name shortcuts:
  AG - Andrew Grimshaw
  EU - Etiennce Urbah
  DK - Daniel Katz
  OS - Oxana Smirnov
  JPN - JP Navarro
  KS - Katsushige Saga
  SC - Steve Crouch
New actions:
[BS] Write-up proposal for including benchmarking requirements in JSDL
[EU/OS] Write-up proposal for using benchmarks as measurement units in
terms of the resource requirements
[SC/AG] Another file staging profile++
Minutes:
Summary of yesterday's session on agreement of moving forward with
existing specs, profiling/updating individual ones and moving towards an
overall profile which brings these together. See AG's session
presentation for more details.
BES++
[See AG's session presentation for more details. List of proposals for
last session and open questions.]
JSDL
Which XML rendering to use for GLUE2 embedded in JSDL, hold before/after
execution. Any others?
AM: rumour of final version of XML rendering from GLUE group is
imminent. [Is it flat or hierarchical?] Can use sub-elements from GLUE2
and include smaller parts.
AG: representative from GLUE?
AM: JP Navarro good candidate.
AG: JSDL's session poorly attended.
ActivityManagement Port Type
i.e. suspend(), resume(), get_status(), terminate()
AG: Idea being that you could have hold states prior to data staging.
With this port type, should have other sensible management activies e.g.
terminate().
EU: a 'purge' exists in PGI requirements.
AG: may get rid of info about the job.
'Activity Integration Profile'
AG: profiling for this management port type. e.g. RNS 1.1, OGSA-ByteIO
1.0 (yesterday's session).
JPN: ad-hoc GLUE2 session - need understanding of how JSDL wants to use
GLUE, to address use cases. 1) allow GLUE2 subelement rendering in other
schemas e.g. JSDL; 2) understanding PGI use cases. GLUE2 describes
configuration, not matching rules. JSDL may want a, b, c but ranges
cannot be described in GLUE2, it's static description.
AG: 1) have BES resources use GLUE2 to describe itself; 2) specify job
requirements. In JSDL, can specify ranges?
AM: how is this matched with the two elements?
AG: shortcoming of JSDL - we need to do OR's, but more really guarded
statements.
   We originally want players in JSDL/BES/GLUE to present in these sessions.
   Do we want to go depth-first? i.e. JSDL first. BES appears
straightforward, EU mentioned one way yesterday. Decision on substate
model? Then can profile out.
EU: can KS provide links to their state model?
KS: will provide this this afternoon.
OS: user likes to describe job assuming resources offer different
configurations, but discovery and matching already done, but when it
gets to resource, it's a single item. Do we need logic instructions?
AG: JSDL with user submits and what ends up in BES can be different.
Important to have consensus on how this is described in JSDL, even if a
broker exists in the middle dealing with this translation.
   User wants to run app, don't care where. May be diff staging thing to
do. With ORs? Better with guarded statements with predicates. Can decide
based on these. This is better.
   Guarded statements presented in CSP, incorporated into lots of
languages in 80's.
OS: implementors - dont know how to implement discovery with logic,
through workflows, or deterministic model. Either end of execution
service, or in client.
AG: request things that have been modified in JSDL from groups e.g. UNICORE.
OS: limit JSDL to what arrives at BES. User tasks can describe whatever
they like.
AG: not a big fan. Would like to take JSDL docs and throw them to
EUropean BES'. Not necessary to define these things, but will hurt
interop, won't be able to use others' JSDLs and vice versa.
OS: broker?
AG: EMS can take in documents and transform them along the way.
EU: Oxana - do you see BES as a site service, or a machine service? ...
AG: would like as symmetric as possible. A site or machine service, but
can be entire set of sites or machines. I like recursive layer not being
part of design.
AG: [on guarded statements]
Select
  (condition expression): action
  (condition expression): action
  ...
Sepearte our resource req section, but repeating groups of that. For
each one have job description, staging elements. On diff machines, want
to execute diff apps and move diff data. i.e. diff actions dependent on
conditions. Too much of a ++ or BES? We want to have small steps really,
instead have smaller pieces. Leave until later?
AG: postpone for now, not much enthusiasm for tackling this.
EU: inside JSDL, have POSIXApplication. Would like to deprecate it. In
same element, it describes subelements on environment, and others
belonging to resources. e.g. input/output/error/working directory -
really belong in env of execution. Already sep blocks for env and
resources in JSDL.
Mike: need to be careful careful.
BS: +1 on deprecating limits.
Mike: places where we specify OS, GLUE for this. We want to still allow
JSDL to specify this, and profile to specify this as GLUE2 attributes.
EU: but conflict of attributes ...
Mike: it complicates the logic.
AG: deprecate - not something to count on in future. Take it out of
follow-on spec. If moving forward, can take these out, just discourage
use in future.
   Go thru JSDL issues and put them on table. Maybe not make decision
yet, but just to list them. BES things fairly well understood. Just JSDL
for now.
   For BES++, what should a BES return on a param sweep [added to slides].
EU: param sweep is dynamic creation of jobs. Not possible to return
fixed list of EPRs since created ... ?
AG: cardinality of set known at request time.
EU: for bulk requests yes, but param sweep number is much higher.
AG: need to get to what a BES should return.
EU: jobs created one after the other in dynamic way. Supposed that there
are too many to list them.
AG: our imp generates EPRs from param sweep at the time.
Mike: timeout situation?
AG: would like to push this back.
EU: propose to replace param sweep with vector or bulk.
AG: disagree strongly.
EU: param sweep is different than vector/bulk. Vector/bulk are finite
lists = finite list of EPRs. Param sweep, not finite, but algorithm to
dynamically create these.
AG: no conditional statements in algorithm, deterministic. Size can be
worked out. Open to discussing param sweep on call, but esoteric for
now, perhaps not useful.
   Another JSDL thing (Mark Morgan originally) found resouce factory
attributes not enough, sometimes want to match on a property, exec host
have requirement that job has attribute. e.g. Kraken, Crays require
statically linked binaries. Can'd send dynamically linked program. App
needs to say this about itself. So would like to have matching token
which has these descriptions for matching on.
EU: solved by GLUE?
AG: in JSDL ability to describe wanting something, but not predefined
e.g. supply as a token.
EU: GLUE env entity.
AG: can have strings? Corresponding element in the resource description?
JPN: app env subelement of other GLUE2 elements, including resource
descriptions.
AG: could we do - host has BLAST 3.4 and app needs BLAST 3.4 for this.
Similarly, relationship for specifying apps as statically linked.
JPN: statically linked - haven't heard of this. Perhaps turn this into
GLUE2 requirements - not currently adtervised but could be.
EU: in GLUE - app name, app version, state of app, can add static desc.
in 'Other info'.
AG: just a string? e.g. accepts VISA? [Yes] we could include this as we go.
BS: difficult - our main thing is having extensible resource model, to
have something that allows resources defined by us (not in GLUE/JSDL
days) to stay extensible for things we dont know yet. Extension -
key/value pairs.
AG: can add key/value pairs to GLUE2? JPN?
JPN: extension elements all over GLUE - the way to do it. Don't know if
formatted as key/valur pairs, perhaps strings.
AG: hinge upon things expressed in JSDL. With GLUE, you must have what
BES must support, and features for matching.
   We'll be guinea pigs.
EU: good to use benchmarks as measurement units.
BS: scalable requirements ... e.g. 2000 CPU hours as benchmark value.
AG: normalisation of CPU hours based on benchmark?
BS: impossible or very very hard. Benchmark very approximate.
AG: scaling factor depends on app.
OS: benchmarks may be related to GLUE2, but not JSDL - a valid unit?
EU: don't like normalisation or scaling factor. Don't use scalable time.
Want to specify 2000 SPECints; it means something.
EG: IEEE computer (Freund) - how do we do scaling based on affinities.
Very app dependent. you suggest we just use specint family?
OS: any.
AG: here's a table - this resource has this factor.
EU: in GLUE2, can specify.
AG: can u point to a benchmark and use it?
DK: can go into long list, but best to think of a couple of parameters.
AG: really want a performance estimator. Problem is either use
simplistic methods, or complicated models. Fine with what ever is desired.
DK: needs to be something app env will provide. More complicated, less
likely.
AM: allowing users to submit benchmarks, nothing could match it.
AG: [removing normalisation and benchmarks on slides] - want to specify
benchs in terms of what we want. It makes sense.
EU: yes.
AG: need to come up with extensible list of benchmarks that each have
float/int associated with them.
EU: already in GLUE.
AG: BES endpoint manager advertises that they are e.g. 27 in this
benchmark. Request says at least 30.
   Do we say time I want is based on this?
DK: 1) using benchmarks nice, need to say what these are - painful, poss
worth it; 2) what do you do with it?
AG: break this up into separate standalone profile in JSDL e.g. JSDL
Benchmark Profile?
OS: depends on application.
DK: no, depend on something generic.
AG: on benchmark. If extensible, have BLAST benchmark.
EU: GLUE bogomips, specint, ... it's extensible.
AG: these sorts of benchmarks typically function of machine.
EU: open enumeration equivalent of string.
AG: this machine, runnig this problem size, ... could be all over the
map for tightly coupled systems. Suggest ask those that want it to
provide proposal - what would it look like for JSDL and for BES Factory
Attributes for how ti would be characterised. At least 2 groups
interested - no reason not to profile it, but not as a MUST for resource
providers.
DK: do res provs have obligation to provide some benchmarks?
AG: not mandatory, but profile in terms of how it _can_ be specified.
DK: interesting to know how it would be used.
AG: yes, people who care about this write this up. e.g. profile, we have
use cases, in JSDL do this as a profile and not mandatory in JSDL.
EU: go to PGI reqs Wiki and JD20.
BS: ideally have a schema for this, but prob not necessary. Have diff
models for providing resources and requesting resources. Often not the
same. ...
AG: action item to write this up?
Action: [BS] Write-up proposal for extensible resource model for systems
we don't know yet - key/value pairs
AG: leave guarded statements for now, but have agreed on many things
[see session slides].
   Want names to some of these things.
Action: [EU/OS] Write-up proposal for using benchmarks as measurement
units in terms of the resource requirements
EU: should be able to describe diff data sources in sequential-try mode
e.g. first fails, second fails, use third (it works).
   Output file semantics not clear, don't discuss.
AG: JSDL is declarative.
EU: order doesn't matter.
AG: shortcoming of JSDL from users - can't do wildcard staging,
including differentiating betwen files and folders.
   Right now we have File Staging extensions. Do a profile for
additional file staging protocols.
Action: [SC/AG] Another file staging profile++
AG: related to wild card staging, may want to do -r recursive,
inclusion/exclusion patterns.
   How to prioritise - last 9 mins, get consensus on this. High, medium,
low groups.
[See AG's session slides for prioritisation list]