Hi;
I think with processor types we just grabbed a
snapshot of the CIM model
and went with that; updating to use a later version of
that would not
cause great difficulty (though the reverse problem
might then exist, in
that it might become more difficult to say that any
kind of x86 arch is
OK for a particular job).
However, I believe we would assume the following
interpretation of
processor requirements: if specified, that's what they
want for all
processors associated with the job. If they didn't
specify, they didn't
care and anything is therefore good enough.
Agreed. Also, one possibility is to explicitly specify some of
the commonly occurring “semi-bound” scenarios, such as “any
x86” architecture. I’m not familiar enough with the CIM world
to know if they can provide us with guidance on how to solve the problem in
general.
Sounds fairly reasonable, though the abstract
filesystem stuff has real
uses in that it makes it much easier to write a job
request that deals
with things like varying locations of home directories
and scratch
space. The alternative is to assume that temporary
files are always
written to somewhere like /tmp, immediately stuffing
interop even
between Unix-based HPC centres (we don't write large
files to /tmp here
because that's not a cluster-wide resource and is
therefore not very
useful) let alone with any Windows-based service.
But it is entirely reasonable to support mount points
and sources by
saying things like "if it doesn't match my
current configuration, I'll
fault". That is most certainly a legal
interpretation of how to process
a JSDL document. This is probably an issue that ought
to be covered in
the primer, when we finally write it. :-)
If we narrow the definitions of mountpoint and mountsource enough and
precisely describe their semantics then we might arrive at something that could
be fairly widely used. I’m thinking of things like saying that you
can’t navigate “out” of a file system via “cd ..”,
etc. This is definitely something to explore.
Since the HPC profile base case treats data staging as being
out-of-scope, the base interface profile will exclude these; but that can be
done independently of anything else. (And, of course, the data staging
extension to the HPC profile will need to deal with this subject in any case,
even if it’s ignored in the base case.)
Strictly this is outside the scope of JSDL, where
we've stuck firmly to
the niche of describing user requests and not the
things with which
those requests may be satisfied. However, I do have
some ideas on this. :-)
The HPC profile (and BES) have
to deal with the issue of describing available resources. So, one way or
the other, the subject will get addressed this summer. As much as
possible, I’d like to avoid duplicating the work done in JSDL for that –
if for no other reason than that users will likely be unhappy if they have to
learn two different ways of describing what they will perceive as being
variations of the core concept, namely resource description – both required
and available.
Maybe other approaches would be better, but the matter
of resource
description is politically tricky for this WG since it
gets into space
claimed by others.
Any advice on this subject would be greatly appreciated. As I
said above, I have to deal with this subject one way or the other and would
prefer to do so with the minimum of feather-ruffling (while still making
progress that results in a usable HPC profile by the end of the summer).
Good point. I suppose our response to this should be
contingent on
whether "context location" (i.e. working
directory) can be defined for
all currently conceived-of job types. I don't know how
to answer this
yet. It's certainly possible for many of the things
we've identified,
but all?
If you allow for the notion of file systems and mount points to be in
the core spec then I would argue that you are implicitly buying into systems
that also support the notion of current working directory (some jobs may of
course not use it).
We don't specify. Portable applications don't change
directory at all in
my experience; it's too full of strange behaviour as
the meaning of all
relative paths change...
I would argue that one not specifying allowed/disallowed behaviors is a
bad approach when interoperability is at issue. (I’m talking about
disallowing “cd ..” out the top, not disallowing change-directory
within the subtree specified by a file system element.)
Fair points, and I'd usually assume that the root FS
was not writable.
It probably is fairly Unix-specific. But it does make
life much easier
for integrating with legacy job systems which can
handle the other FS
types by translation into the root and adding a prefix
to the paths.
FWIW, I wouldn't use ROOT in my jobs. :-)
Again, from an interop point-of-view, this seems dangerous.
It might be a good idea to codify some best practice
on this in the HPC
profile.
Agreed.
Regarding your reactions to my straw man proposal, it seems like you pretty
much agree with everything except the following:
·
You’re not convinced how
universal the posix extension elements for things like command-line arguments
and working directory are. My response is that I think they are at least as
universal as the data staging elements.
·
You don’t want to move the
data staging section out of the core specification. For the HPC profile
base case, data staging elements will be prohibited since they are out-of-scope
for the base case. The HPC profile extension for data staging will allow
the JSDL data staging elements. Whether or not these should be in a
separate JSDL extension or whether they can be generalized to cover a wide(r)
range of systems is a topic for future discussion.
·
You’re leery of tackling the
resource description problem. Understandable, although the HPC profile
working group will have to and will be seeking guidance from the JSDL and other
communities on how to do so.
Is that a fair characterization of your position?
Thanks,
Marvin.
-----Original Message-----
From: Donal K. Fellows [mailto:donal.k.fellows@manchester.ac.uk]
Sent: Friday, June 09, 2006 2:45 AM
To: Marvin Theimer
Cc: JSDL Working Group; ogsa-bes-wg@ggf.org;
Subject: Re: [jsdl-wg] Questions and potential changes to JSDL, as seen from
HPC Profile point-of-view
Marvin Theimer wrote:
> Coming from the point-of-view of the HPC Profile working group, I
have
> several questions about JSDL, as well as some straw man thoughts
about
> how JSDL should/could relate to the HPC Profile specification that
I’m
> involved with. Some of my questions lead me to restrictions
on JSDL
> that an HPC profile specification might make. Other
questions lead to
> potential changes that might be made as part of creating future
versions
> of JSDL. (I’m well aware that JSDL 1.0 was meant as a
starting point
> rather than the final word on job submission descriptions and so
please
> interpret my questions as being an attempt at constructive
suggestions
> rather than a criticism of a very fine first step by the JSDL working
> group.)
I'm going to work through these things as I read through them, so the
answers (well, my answers) might be a little disjointed. :-)
> At a high level, there are several general questions that came up
when
> reading the JSDL 1.0 specification:
>
> · Can JSDL
documents describe jobs other than Linux/Unix/Posix
> jobs? For example, things like mount points and mount
sources do not
> map in a completely straight-forward manner to how file systems
are
> provided in the Windows world.
Most certainly. The intent is that ultimately JSDL jobs should be able
to describe pretty much any request for an atomic activity, and the
POSIXApplication stuff was just a seed so that at least one common case
would be handled by the initial specification. Work is ongoing with an
extension to that to support parallel (mainly MPI, but also some other
archtectures too) jobs, and we've had in mind other kinds of jobs for a
while (including SQL jobs, Web-service invokation jobs, and JVM jobs,
but obviously not limited to those).
On the matter of mount points, the interpretation of a mount source is
not that the mount source should be mounted at the mount point, but
rather that the job should fail if the mount is not present. Now, a
JSDL
consumer might react to that failure by trying to perform the mount,
but
it is not required. (The meaning of the name of the mount source is not
defined IIRC, though it probably ought to be URI-like, meaning that SMB
mounts would work fine under windows with suitable munging.)
We'd hope that most jobs would not actually specify the mount point,
but
would instead use the facilities provided by the JSDL abstract file
system processing semantics to adapt to whatever was available.
> · Is JSDL
expressive enough to describe all the needs of a job?
> For example, it is unclear how one would specify a requirement for
> something like a particular instruction set variation of the IA86
> architecture (e.g. the SSE3 version of the Pentium) or how one
would
> specify that AMD processors are required rather than Intel ones
(because
> the optimized libraries and the optimizations generated by the
compiler
> used will differ for each). For another example, it is
unclear how one
> would specify that all the compute nodes used for something like
an MPI
> job should have the same hardware.
I think with processor types we just grabbed a snapshot of the CIM
model
and went with that; updating to use a later version of that would not
cause great difficulty (though the reverse problem might then exist, in
that it might become more difficult to say that any kind of x86 arch is
OK for a particular job).
However, I believe we would assume the following interpretation of
processor requirements: if specified, that's what they want for all
processors associated with the job. If they didn't specify, they didn't
care and anything is therefore good enough.
> · How will
JSDL’s normative set of enumeration values for things
> like processor architecture and operating system be kept
up-to-date and
> relevant? Also, how should things like operating system
version get
> specified in a normative manner that will enable interoperability
among
> multiple clients and job scheduling services? For example,
things like
> Linux and Windows versions are constantly being introduced, each
with
> potentially significant differences in capabilities that a job
might
> depend on. Without a normative way of specifying these
constantly
> evolving version sets it will be difficult, if not impossible, to
create
> interoperable job submission clients and job scheduling services
> (including meta-scheduling services where multiple schedulers must
> interoperate with each other).
I don't know. :-) Maybe we should say that additional things as defined
in some other model (e.g. CIM) SHOULD be accepted? (As I said above, we
just took a snapshot of that model; updating isn't really a big deal.)
> · Although JSDL
specifies a means of including additional
> non-normative elements and attributes in a document, non-normative
> extensions make interoperability difficult. This implies the
need for
> normative extensions to JSDL beyond the Posix extension currently
> described in the 1.0 specification. Are there plans to
define
> additional extension profiles to address the above questions
surrounding
> expressive power and normative descriptions of things like current
OS
> types and versions?
We do not currently have *specific* plans to do this, but that does not
mean we cannot have such specific plans in fairly short order. :-)
> · If one accepts
the need for a variety of extension profiles
> then this raises the question of what should be in the base
case. For
> example, it could be argued that data staging – with its
attendant
> aspects such as mount points and mount sources – should be
defined in an
> extension rather than in the core specification that will need to
cover
> a variety of systems beyond just Linux/Unix/Posix.
Similarly, one might
> argue that the base case should focus on what’s
/functionally/ necessary
> to execute a job correctly and should leave things that are
> “optimization hints”, such as CPU speed and network
bandwidth
> specifications, to extension profiles.
Sounds fairly reasonable, though the abstract filesystem stuff has real
uses in that it makes it much easier to write a job request that deals
with things like varying locations of home directories and scratch
space. The alternative is to assume that temporary files are always
written to somewhere like /tmp, immediately stuffing interop even
between Unix-based HPC centres (we don't write large files to /tmp here
because that's not a cluster-wide resource and is therefore not very
useful) let alone with any Windows-based service.
But it is entirely reasonable to support mount points and sources by
saying things like "if it doesn't match my current configuration,
I'll
fault". That is most certainly a legal interpretation of how to
process
a JSDL document. This is probably an issue that ought to be covered in
the primer, when we finally write it. :-)
> · How are
concepts such as IndividualCPUSpeed and
> IndividualNetworkBandwidth intended to be defined and used in
practice?
> I understand the concept of specifying things like the amount of
> physical memory or disk space that a job will require in order to
be
> able to run. However, CPU speed and network bandwidth
don’t represent
> functional requirements for a job – meaning that a job will
correctly
> run and produce the same results irrespective of the CPU speed and
> network bandwidth available to it. Also, the current
definitions seem
> fuzzy: the megahertz number for a CPU does not tell you how fast a
given
> compute node will be able to execute various kinds of jobs, given
all
> the various hardware factors that can affect the performance of a
> processor (consider the presence/absence of floating point
support, the
> memory caching architecture, etc.). Similarly, is network
bandwidth
> meant to represent the theoretical maximum of a compute
node’s network
> interface card? Is it expected to take into account the
performance of
> the switch that the compute node is attached to? Since
switch
> performance is partially a function of the pattern of (aggregate)
> traffic going through it, the network bandwidth that a job such as
an
> MPI application can expect to receive will depend on the /type/ of
> communications patterns employed by the application. How
should this
> aspect of network bandwidth be reflected – if at all –
in the network
> bandwidth values that a job requests and that compute nodes
advertise?
CPU speed is a fairly meaningless value really, since it is at best
only
a poor approximant to application performance (which is what people are
really interested in) though app-perf is not portable in any sensible
way as you can't extrapolate from the performance of one application to
that of another. But it's probably the best we've got (we could do
FLOPS
or MIPS instead I suppose, but I suspect neither is much better).
Network bandwidth is worse, because it is only meaningful when defined
with respect to a defined pair of endpoints (or, more particularly
here,
w.r.t. a defined remote endpoint, since the other one is defined by
where the job is submitted to). What's worse is that latency isn't
defined at all, and that's at least as important for complex apps. In
short, I think we didn't get the network bandwidth right. :-\
However, the general policy of accepting quality-of-service
requirements
on resources is one I agree with, since they really do matter and they
are constraints on whether a particular resource is fit for the user's
purpose.
> · JSDL is
intended for describing the requirements of a job being
> submitted for execution. To enable matchmaking between
submitted jobs
> and available computational resources there must also be a way of
> describing existing/available resources. While much of JSDL
can be used
> for this purpose, it is also clear that various extensions are
> necessary. For example, to describe a compute cluster
requires that one
> be able to specify the resources for each compute node in the
cluster
> (which may be a heterogeneous lot). Similarly, to describe a
compute
> node with multiple network interfaces would require an extension
to the
> current model, which assumes that only a single instance of such
things
> can exist. This raises the question of whether something
other than
> JSDL is intended to be used for describing available computational
> resources or whether there are intensions to extend JSDL to enable
it to
> describe such resources.
Strictly this is outside the scope of JSDL, where we've stuck firmly to
the niche of describing user requests and not the things with which
those requests may be satisfied. However, I do have some ideas on this.
:-)
JSDL terms can indeed be used for resource description, and this is
because you can interpret them as saying something like "this is
the
maximal set of processors I will allocate to any job you submit".
The UniGrids project has looked at several ways to do such resource
descriptions based over JSDL. The simplest model we've found was to say
that each target system service (BES-analog) supports a single unified
homogenous resource description, and that where we have a heterogenous
cluster we describe that as multiple services, each with smaller claims
of range of resources allocated to it. This allows for a simple
resource
model and matching rules, but it covers the 90% case neatly.
Let me flesh that out with an example. Suppose we have a cluster of
machines, four from Intel (with 2GB memory each) and four from AMD (two
with 1GB, two with 4GB). This induces 5 services, with resource claims
as follows:
* 2 AMD processors, 4GB
* 4 AMD processors, 1GB
* 4 Intel processors, 2GB
* 6 x86 processors, 2GB
* 8 x86 processors, 1GB
It should be noted that these separate services woud actually be pretty
cheap in our implementation, since we can host them in the same
container at a cost of a few extra objects. :-)
Maybe other approaches would be better, but the matter of resource
description is politically tricky for this WG since it gets into space
claimed by others.
> · The current
specification stipulates that conformant
> implementations must be able to parse all the elements and
attributes
> defined in the spec, but doesn’t require that any of them be
supplied.
> Thus, a scheduling service that does nothing could claim to be
compliant
> as long as it can correctly parse JSDL documents. For
interoperability
> purposes, I would argue that the spec should define a minimum set
of
> elements that any compliant service must be able to supply.
Otherwise
> clients will not be able to make any assumptions about what they
can
> specify in a JSDL document and, in particular, client applications
that
> programmatically submit job submission requests will not be
possible
> since they can’t assume that any valid JSDL document will
actually be
> acceptable by any given job submission service.
I'd argue that this profiling of JSDL should be done by BES or
yourselves (the HPC profile). This is because there are other cases
(e.g. as synchronization points in workflow processing) where null jobs
are actually useful.
> · I have a number
of questions about data staging:
I have one major observation: the data staging stuff is known to be a
long way off imperfect.
> · Although the
notions of working directory and environment
> variables are defined in the posix extension, they are implicitly
> assuming in the data staging section of the core
specification. This
> implies to me that either (a) data staging is made an extension or
(b)
> these concepts are made a normative, required part of the core
> specification.
Good point. I suppose our response to this should be contingent on
whether "context location" (i.e. working directory) can be
defined for
all currently conceived-of job types. I don't know how to answer this
yet. It's certainly possible for many of the things we've identified,
but all?
> · Recursive
directory copying can be specified, but is not
> required to be supplied by any job submission service. This
makes it
> difficult to write applications that programmatically define their
data
> staging needs since they cannot in the current design determine
whether
> any given job submission service implements recursive directory
> copying. In practice this may mean that programmatically
generated job
> submissions will only ever use lists of individual files to stage.
It means that only _interoperable_ ones will do that, but I think there
are already implementations of directory staging out there and clients
that are generating jobs that use it. I may be wrong though. :-)
> · The current
definitions of the well-known file systems seem
> imprecise to me. In particular:
>
> · What are the
navigation rules associated with each? Can you cd
> out of the subtree that each represents? ROOT almost
certainly does not
> allow that. Is there an assumption that one can cd out of
HOME or TMP
> or SCRATCH? Hopefully not, since that would make these file
systems
> even more Unix/Linux-centric, plus one would now need to specify
what
> clients can expect to see when they do so.
We don't specify. Portable applications don't change directory at all
in
my experience; it's too full of strange behaviour as the meaning of all
relative paths change...
> · What is ROOT
intended to be used for? Are there assumptions
> about what resides under root? Are there assumptions about
what an
> application can read/write under the ROOT subtree? (ROOT
also seems
> like the most Unix-specific of the 4 file system types defined.)
Fair points, and I'd usually assume that the root FS was not writable.
It probably is fairly Unix-specific. But it does make life much easier
for integrating with legacy job systems which can handle the other FS
types by translation into the root and adding a prefix to the paths.
FWIW, I wouldn't use ROOT in my jobs. :-)
> · What are the
sharing/consistency semantics of each file system
> in situations where a job is a multi-node application running on
> something like a cluster? Is HOME visible to all compute
nodes in a
> data-consistent manner? I’m guessing that TMP would be
assumed to be
> strictly local to each compute node, so that things like MPI
> applications would need to be cognizant that they are writing
multiple
> files to multiple separate storage systems when they write to a
file in
> TMP – and furthermore that data staging of such files after
a job has
> run will result in multiple files that all map to the same target
file.
I've been assuming that (or at least configuring our local systems so
that) TMP was node-local and SCRATCH was cluster-wide.
> · Can other users
write over or delete your data in TMP and/or
> SCRATCH? Is data in these file systems visible to other
users or does
> each job get its own private TMP and SCRATCH?
I'd assume that other users never can overwrite your data and wouldn't
make any assumptions at all about the level of isolation of either TMP
or SCRATCH with respect to other jobs owned by the same user. But that
would make an excellent topic to be included in any system policy
statement. (Another policy might be that your job submission has to be
digitally signed and the signer's certificate has to be signed in turn
by a particular CA.)
It might be a good idea to codify some best practice on this in the HPC
profile.
> · How long does
data in SCRATCH stay around? Without some
> normative definition – or at least a normative lower bound
– on data
> lifetime clients will have to assume that the data can vanish
> arbitrarily and things like multi-job workflows will be very
difficult
> to write if they try to take advantage of SCRATCH space to avoid
> unnecessary data staging actions to/from a computing facility.
Again, that's something that is a site policy (I think we've locally
got
a "one month after last use, with some fairly coarse
granularity"
policy). However, grid systems bring something to the table here in
that
by describing jobs as resources in their own right (with definite known
lifespans) it should be possible to design systems that make better
decisions over when a piece of temporary data has become unreferenced
and may be deleted.
Profiling some best practice here seems sensible.
> · From an
interoperability and programmatic submission
> point-of-view, it is important to know which transports any given
job
> submission service can be expected to support. This seems
like another
> area where a normative minimal set that all job submission
services must
> implement needs to be defined.
Agreed, but this is something that we basically punted on. (Also, the
notion of what is a source or destination for a staging action turns
out
to be messy sometimes.
> Given these questions, as well as the mandate for the HPC profile
to
> define a simple base interface (that can cover the HPC use case of
> submitting jobs to a compute cluster), I would like to present the
> following straw man proposal for feedback from this community:
>
> · Restructure the
JSDL specification as a small core
> specification that must be universally implemented – i.e.
not just
> parsable, but also suppliable by all compliant job submission
services –
> and a number of optional extension profiles.
Sounds sensible.
> · Declare
concepts such as executable path, command-line
> arguments, environment variables, and working directory to be
generic
> and include them in the core JSDL specification rather than the
posix
> extension. This may enable the core specification to support
things
> like Windows-based jobs (TBD). The goal here is to define a
core JSDL
> specification that in-and-of-itself could enable job submission to
a
> fairly wide range of execution subsystems, including both the
> Unix/Linux/Posix world and the Windows world.
Again, it's not quite clear to me that all those concepts are
meaningful
in all job types (as opposed to those that are clearly just a way to
execute some binary with a bunch of arguments).
> · Move data
staging to an extension.
I'm not sure about this.
> · Create precise
definitions of the various concepts introduced
> in the data staging extension, including normative requirements
about
> whether or not one can change directory up and out of a file
system’s
> root directory, etc.
Good idea.
> · Define which
transports are expected to be implemented by all
> compliant services.
Very good idea.
> · Move the
various enumeration types – e.g. for CPU architecture
> and OS – to separate specification documents so that they
can evolve
> without requiring corresponding and constant revision of the core
JSDL
> specification.
Excellent idea. :-)
> · Define
extension profiles (eventually, not right away) that
> enable richer description of hardware and software requirements,
such as
> details of the CPU architecture or OS capabilities. As part
of this,
> move optimization hints, such as CPU speed and network bandwidth
> elements out of the JSDL core and into a separate extension
profile.
Sounds pretty sensible to me.
> · Embrace the
issue of how to specify available resources at an
> execution subsystem. Start by defining a base case that
allows the
> description of compute clusters by creating a compound JSDL
document
> that consists of an outer element that ties together a sequence of
> individual JSDL elements, each of which describes a single compute
node
> of a compute cluster. Define an explicit notion of extension
profiles
> that could define other ways of describing computational resources
> beyond just an array of simple JSDL descriptions.
Interesting. Probably a good topic for discussion going forward.
> Now, as presented above, my straw man proposal looks like
suggestions
> for changes that might go into a JSDL-1.1 or JSDL-2.0
specification. In
> the near-term, the HPC profile working group will be exploring
what can
> be done with just JSDL-1.0 and restrictions to that
specification. The
> restrictions would correspond to disallowing those parts of the
JSDL-1.0
> specification that the above proposal advocates moving to
extension
> profiles. It will also explore whether a restricted version
of the
> posix extension could be used to cover most common Windows cases.
Sounds like a reasonable plan to me.
Donal.