Hi;

I think with processor types we just grabbed a snapshot of the CIM model

and went with that; updating to use a later version of that would not

cause great difficulty (though the reverse problem might then exist, in

that it might become more difficult to say that any kind of x86 arch is

OK for a particular job).

However, I believe we would assume the following interpretation of

processor requirements: if specified, that's what they want for all

processors associated with the job. If they didn't specify, they didn't

care and anything is therefore good enough.

Agreed. Also, one possibility is to explicitly specify some of the commonly occurring “semi-bound” scenarios, such as “any x86” architecture. I’m not familiar enough with the CIM world to know if they can provide us with guidance on how to solve the problem in general.

Sounds fairly reasonable, though the abstract filesystem stuff has real

uses in that it makes it much easier to write a job request that deals

with things like varying locations of home directories and scratch

space. The alternative is to assume that temporary files are always

written to somewhere like /tmp, immediately stuffing interop even

between Unix-based HPC centres (we don't write large files to /tmp here

because that's not a cluster-wide resource and is therefore not very

useful) let alone with any Windows-based service.

But it is entirely reasonable to support mount points and sources by

saying things like "if it doesn't match my current configuration, I'll

fault". That is most certainly a legal interpretation of how to process

a JSDL document. This is probably an issue that ought to be covered in

the primer, when we finally write it. :-)

If we narrow the definitions of mountpoint and mountsource enough and precisely describe their semantics then we might arrive at something that could be fairly widely used. I’m thinking of things like saying that you can’t navigate “out” of a file system via “cd ..”, etc. This is definitely something to explore.

Since the HPC profile base case treats data staging as being out-of-scope, the base interface profile will exclude these; but that can be done independently of anything else. (And, of course, the data staging extension to the HPC profile will need to deal with this subject in any case, even if it’s ignored in the base case.)

Strictly this is outside the scope of JSDL, where we've stuck firmly to

the niche of describing user requests and not the things with which

those requests may be satisfied. However, I do have some ideas on this. :-)

The HPC profile (and BES) have to deal with the issue of describing available resources. So, one way or the other, the subject will get addressed this summer. As much as possible, I’d like to avoid duplicating the work done in JSDL for that – if for no other reason than that users will likely be unhappy if they have to learn two different ways of describing what they will perceive as being variations of the core concept, namely resource description – both required and available.

Maybe other approaches would be better, but the matter of resource

description is politically tricky for this WG since it gets into space

claimed by others.

Any advice on this subject would be greatly appreciated. As I said above, I have to deal with this subject one way or the other and would prefer to do so with the minimum of feather-ruffling (while still making progress that results in a usable HPC profile by the end of the summer).

Good point. I suppose our response to this should be contingent on

whether "context location" (i.e. working directory) can be defined for

all currently conceived-of job types. I don't know how to answer this

yet. It's certainly possible for many of the things we've identified,

but all?

If you allow for the notion of file systems and mount points to be in the core spec then I would argue that you are implicitly buying into systems that also support the notion of current working directory (some jobs may of course not use it).

We don't specify. Portable applications don't change directory at all in

my experience; it's too full of strange behaviour as the meaning of all

relative paths change...

I would argue that one not specifying allowed/disallowed behaviors is a bad approach when interoperability is at issue. (I’m talking about disallowing “cd ..” out the top, not disallowing change-directory within the subtree specified by a file system element.)

Fair points, and I'd usually assume that the root FS was not writable.

It probably is fairly Unix-specific. But it does make life much easier

for integrating with legacy job systems which can handle the other FS

types by translation into the root and adding a prefix to the paths.

FWIW, I wouldn't use ROOT in my jobs. :-)

Again, from an interop point-of-view, this seems dangerous.

It might be a good idea to codify some best practice on this in the HPC

profile.

Agreed.

Regarding your reactions to my straw man proposal, it seems like you pretty much agree with everything except the following:

· You’re not convinced how universal the posix extension elements for things like command-line arguments and working directory are. My response is that I think they are at least as universal as the data staging elements.

· You don’t want to move the data staging section out of the core specification. For the HPC profile base case, data staging elements will be prohibited since they are out-of-scope for the base case. The HPC profile extension for data staging will allow the JSDL data staging elements. Whether or not these should be in a separate JSDL extension or whether they can be generalized to cover a wide(r) range of systems is a topic for future discussion.

· You’re leery of tackling the resource description problem. Understandable, although the HPC profile working group will have to and will be seeking guidance from the JSDL and other communities on how to do so.

Is that a fair characterization of your position?

Thanks,

Marvin.

-----Original Message-----
From: Donal K. Fellows [mailto:donal.k.fellows@manchester.ac.uk]
Sent: Friday, June 09, 2006 2:45 AM
To: Marvin Theimer
Cc: JSDL Working Group; ogsa-bes-wg@ggf.org; Ed Lassettre; Ming Xu (WINDOWS)
Subject: Re: [jsdl-wg] Questions and potential changes to JSDL, as seen from HPC Profile point-of-view

Marvin Theimer wrote:

> Coming from the point-of-view of the HPC Profile working group, I have

> several questions about JSDL, as well as some straw man thoughts about

> how JSDL should/could relate to the HPC Profile specification that I’m

> involved with. Some of my questions lead me to restrictions on JSDL

> that an HPC profile specification might make. Other questions lead to

> potential changes that might be made as part of creating future versions

> of JSDL. (I’m well aware that JSDL 1.0 was meant as a starting point

> rather than the final word on job submission descriptions and so please

> interpret my questions as being an attempt at constructive suggestions

> rather than a criticism of a very fine first step by the JSDL working

> group.)

I'm going to work through these things as I read through them, so the

answers (well, my answers) might be a little disjointed. :-)

> At a high level, there are several general questions that came up when

> reading the JSDL 1.0 specification:

> · Can JSDL documents describe jobs other than Linux/Unix/Posix

> jobs? For example, things like mount points and mount sources do not

> map in a completely straight-forward manner to how file systems are

> provided in the Windows world.

Most certainly. The intent is that ultimately JSDL jobs should be able

to describe pretty much any request for an atomic activity, and the

POSIXApplication stuff was just a seed so that at least one common case

would be handled by the initial specification. Work is ongoing with an

extension to that to support parallel (mainly MPI, but also some other

archtectures too) jobs, and we've had in mind other kinds of jobs for a

while (including SQL jobs, Web-service invokation jobs, and JVM jobs,

but obviously not limited to those).

On the matter of mount points, the interpretation of a mount source is

not that the mount source should be mounted at the mount point, but

rather that the job should fail if the mount is not present. Now, a JSDL

consumer might react to that failure by trying to perform the mount, but

it is not required. (The meaning of the name of the mount source is not

defined IIRC, though it probably ought to be URI-like, meaning that SMB

mounts would work fine under windows with suitable munging.)

We'd hope that most jobs would not actually specify the mount point, but

would instead use the facilities provided by the JSDL abstract file

system processing semantics to adapt to whatever was available.

> · Is JSDL expressive enough to describe all the needs of a job?

> For example, it is unclear how one would specify a requirement for

> something like a particular instruction set variation of the IA86

> architecture (e.g. the SSE3 version of the Pentium) or how one would

> specify that AMD processors are required rather than Intel ones (because

> the optimized libraries and the optimizations generated by the compiler

> used will differ for each). For another example, it is unclear how one

> would specify that all the compute nodes used for something like an MPI

> job should have the same hardware.

I think with processor types we just grabbed a snapshot of the CIM model

and went with that; updating to use a later version of that would not

cause great difficulty (though the reverse problem might then exist, in

that it might become more difficult to say that any kind of x86 arch is

OK for a particular job).

However, I believe we would assume the following interpretation of

processor requirements: if specified, that's what they want for all

processors associated with the job. If they didn't specify, they didn't

care and anything is therefore good enough.

> · How will JSDL’s normative set of enumeration values for things

> like processor architecture and operating system be kept up-to-date and

> relevant? Also, how should things like operating system version get

> specified in a normative manner that will enable interoperability among

> multiple clients and job scheduling services? For example, things like

> Linux and Windows versions are constantly being introduced, each with

> potentially significant differences in capabilities that a job might

> depend on. Without a normative way of specifying these constantly

> evolving version sets it will be difficult, if not impossible, to create

> interoperable job submission clients and job scheduling services

> (including meta-scheduling services where multiple schedulers must

> interoperate with each other).

I don't know. :-) Maybe we should say that additional things as defined

in some other model (e.g. CIM) SHOULD be accepted? (As I said above, we

just took a snapshot of that model; updating isn't really a big deal.)

> · Although JSDL specifies a means of including additional

> non-normative elements and attributes in a document, non-normative

> extensions make interoperability difficult. This implies the need for

> normative extensions to JSDL beyond the Posix extension currently

> described in the 1.0 specification. Are there plans to define

> additional extension profiles to address the above questions surrounding

> expressive power and normative descriptions of things like current OS

> types and versions?

We do not currently have *specific* plans to do this, but that does not

mean we cannot have such specific plans in fairly short order. :-)

> · If one accepts the need for a variety of extension profiles

> then this raises the question of what should be in the base case. For

> example, it could be argued that data staging – with its attendant

> aspects such as mount points and mount sources – should be defined in an

> extension rather than in the core specification that will need to cover

> a variety of systems beyond just Linux/Unix/Posix. Similarly, one might

> argue that the base case should focus on what’s /functionally/ necessary

> to execute a job correctly and should leave things that are

> “optimization hints”, such as CPU speed and network bandwidth

> specifications, to extension profiles.

Sounds fairly reasonable, though the abstract filesystem stuff has real

uses in that it makes it much easier to write a job request that deals

with things like varying locations of home directories and scratch

space. The alternative is to assume that temporary files are always

written to somewhere like /tmp, immediately stuffing interop even

between Unix-based HPC centres (we don't write large files to /tmp here

because that's not a cluster-wide resource and is therefore not very

useful) let alone with any Windows-based service.

But it is entirely reasonable to support mount points and sources by

saying things like "if it doesn't match my current configuration, I'll

fault". That is most certainly a legal interpretation of how to process

a JSDL document. This is probably an issue that ought to be covered in

the primer, when we finally write it. :-)

> · How are concepts such as IndividualCPUSpeed and

> IndividualNetworkBandwidth intended to be defined and used in practice?

> I understand the concept of specifying things like the amount of

> physical memory or disk space that a job will require in order to be

> able to run. However, CPU speed and network bandwidth don’t represent

> functional requirements for a job – meaning that a job will correctly

> run and produce the same results irrespective of the CPU speed and

> network bandwidth available to it. Also, the current definitions seem

> fuzzy: the megahertz number for a CPU does not tell you how fast a given

> compute node will be able to execute various kinds of jobs, given all

> the various hardware factors that can affect the performance of a

> processor (consider the presence/absence of floating point support, the

> memory caching architecture, etc.). Similarly, is network bandwidth

> meant to represent the theoretical maximum of a compute node’s network

> interface card? Is it expected to take into account the performance of

> the switch that the compute node is attached to? Since switch

> performance is partially a function of the pattern of (aggregate)

> traffic going through it, the network bandwidth that a job such as an

> MPI application can expect to receive will depend on the /type/ of

> communications patterns employed by the application. How should this

> aspect of network bandwidth be reflected – if at all – in the network

> bandwidth values that a job requests and that compute nodes advertise?

CPU speed is a fairly meaningless value really, since it is at best only

a poor approximant to application performance (which is what people are

really interested in) though app-perf is not portable in any sensible

way as you can't extrapolate from the performance of one application to

that of another. But it's probably the best we've got (we could do FLOPS

or MIPS instead I suppose, but I suspect neither is much better).

Network bandwidth is worse, because it is only meaningful when defined

with respect to a defined pair of endpoints (or, more particularly here,

w.r.t. a defined remote endpoint, since the other one is defined by

where the job is submitted to). What's worse is that latency isn't

defined at all, and that's at least as important for complex apps. In

short, I think we didn't get the network bandwidth right. :-\

However, the general policy of accepting quality-of-service requirements

on resources is one I agree with, since they really do matter and they

are constraints on whether a particular resource is fit for the user's

purpose.

> · JSDL is intended for describing the requirements of a job being

> submitted for execution. To enable matchmaking between submitted jobs

> and available computational resources there must also be a way of

> describing existing/available resources. While much of JSDL can be used

> for this purpose, it is also clear that various extensions are

> necessary. For example, to describe a compute cluster requires that one

> be able to specify the resources for each compute node in the cluster

> (which may be a heterogeneous lot). Similarly, to describe a compute

> node with multiple network interfaces would require an extension to the

> current model, which assumes that only a single instance of such things

> can exist. This raises the question of whether something other than

> JSDL is intended to be used for describing available computational

> resources or whether there are intensions to extend JSDL to enable it to

> describe such resources.

Strictly this is outside the scope of JSDL, where we've stuck firmly to

the niche of describing user requests and not the things with which

those requests may be satisfied. However, I do have some ideas on this. :-)

JSDL terms can indeed be used for resource description, and this is

because you can interpret them as saying something like "this is the

maximal set of processors I will allocate to any job you submit".

The UniGrids project has looked at several ways to do such resource

descriptions based over JSDL. The simplest model we've found was to say

that each target system service (BES-analog) supports a single unified

homogenous resource description, and that where we have a heterogenous

cluster we describe that as multiple services, each with smaller claims

of range of resources allocated to it. This allows for a simple resource

model and matching rules, but it covers the 90% case neatly.

Let me flesh that out with an example. Suppose we have a cluster of

machines, four from Intel (with 2GB memory each) and four from AMD (two

with 1GB, two with 4GB). This induces 5 services, with resource claims

as follows:

* 2 AMD processors, 4GB

* 4 AMD processors, 1GB

* 4 Intel processors, 2GB

* 6 x86 processors, 2GB

* 8 x86 processors, 1GB

It should be noted that these separate services woud actually be pretty

cheap in our implementation, since we can host them in the same

container at a cost of a few extra objects. :-)

Maybe other approaches would be better, but the matter of resource

description is politically tricky for this WG since it gets into space

claimed by others.

> · The current specification stipulates that conformant

> implementations must be able to parse all the elements and attributes

> defined in the spec, but doesn’t require that any of them be supplied.

> Thus, a scheduling service that does nothing could claim to be compliant

> as long as it can correctly parse JSDL documents. For interoperability

> purposes, I would argue that the spec should define a minimum set of

> elements that any compliant service must be able to supply. Otherwise

> clients will not be able to make any assumptions about what they can

> specify in a JSDL document and, in particular, client applications that

> programmatically submit job submission requests will not be possible

> since they can’t assume that any valid JSDL document will actually be

> acceptable by any given job submission service.

I'd argue that this profiling of JSDL should be done by BES or

yourselves (the HPC profile). This is because there are other cases

(e.g. as synchronization points in workflow processing) where null jobs

are actually useful.

> · I have a number of questions about data staging:

I have one major observation: the data staging stuff is known to be a

long way off imperfect.

> · Although the notions of working directory and environment

> variables are defined in the posix extension, they are implicitly

> assuming in the data staging section of the core specification. This

> implies to me that either (a) data staging is made an extension or (b)

> these concepts are made a normative, required part of the core

> specification.

Good point. I suppose our response to this should be contingent on

whether "context location" (i.e. working directory) can be defined for

all currently conceived-of job types. I don't know how to answer this

yet. It's certainly possible for many of the things we've identified,

but all?

> · Recursive directory copying can be specified, but is not

> required to be supplied by any job submission service. This makes it

> difficult to write applications that programmatically define their data

> staging needs since they cannot in the current design determine whether

> any given job submission service implements recursive directory

> copying. In practice this may mean that programmatically generated job

> submissions will only ever use lists of individual files to stage.

It means that only _interoperable_ ones will do that, but I think there

are already implementations of directory staging out there and clients

that are generating jobs that use it. I may be wrong though. :-)

> · The current definitions of the well-known file systems seem

> imprecise to me. In particular:

> · What are the navigation rules associated with each? Can you cd

> out of the subtree that each represents? ROOT almost certainly does not

> allow that. Is there an assumption that one can cd out of HOME or TMP

> or SCRATCH? Hopefully not, since that would make these file systems

> even more Unix/Linux-centric, plus one would now need to specify what

> clients can expect to see when they do so.

We don't specify. Portable applications don't change directory at all in

my experience; it's too full of strange behaviour as the meaning of all

relative paths change...

> · What is ROOT intended to be used for? Are there assumptions

> about what resides under root? Are there assumptions about what an

> application can read/write under the ROOT subtree? (ROOT also seems

> like the most Unix-specific of the 4 file system types defined.)

Fair points, and I'd usually assume that the root FS was not writable.

It probably is fairly Unix-specific. But it does make life much easier

for integrating with legacy job systems which can handle the other FS

types by translation into the root and adding a prefix to the paths.

FWIW, I wouldn't use ROOT in my jobs. :-)

> · What are the sharing/consistency semantics of each file system

> in situations where a job is a multi-node application running on

> something like a cluster? Is HOME visible to all compute nodes in a

> data-consistent manner? I’m guessing that TMP would be assumed to be

> strictly local to each compute node, so that things like MPI

> applications would need to be cognizant that they are writing multiple

> files to multiple separate storage systems when they write to a file in

> TMP – and furthermore that data staging of such files after a job has

> run will result in multiple files that all map to the same target file.

I've been assuming that (or at least configuring our local systems so

that) TMP was node-local and SCRATCH was cluster-wide.

> · Can other users write over or delete your data in TMP and/or

> SCRATCH? Is data in these file systems visible to other users or does

> each job get its own private TMP and SCRATCH?

I'd assume that other users never can overwrite your data and wouldn't

make any assumptions at all about the level of isolation of either TMP

or SCRATCH with respect to other jobs owned by the same user. But that

would make an excellent topic to be included in any system policy

statement. (Another policy might be that your job submission has to be

digitally signed and the signer's certificate has to be signed in turn

by a particular CA.)

It might be a good idea to codify some best practice on this in the HPC

profile.

> · How long does data in SCRATCH stay around? Without some

> normative definition – or at least a normative lower bound – on data

> lifetime clients will have to assume that the data can vanish

> arbitrarily and things like multi-job workflows will be very difficult

> to write if they try to take advantage of SCRATCH space to avoid

> unnecessary data staging actions to/from a computing facility.

Again, that's something that is a site policy (I think we've locally got

a "one month after last use, with some fairly coarse granularity"

policy). However, grid systems bring something to the table here in that

by describing jobs as resources in their own right (with definite known

lifespans) it should be possible to design systems that make better

decisions over when a piece of temporary data has become unreferenced

and may be deleted.

Profiling some best practice here seems sensible.

> · From an interoperability and programmatic submission

> point-of-view, it is important to know which transports any given job

> submission service can be expected to support. This seems like another

> area where a normative minimal set that all job submission services must

> implement needs to be defined.

Agreed, but this is something that we basically punted on. (Also, the

notion of what is a source or destination for a staging action turns out

to be messy sometimes. Alas.)

> Given these questions, as well as the mandate for the HPC profile to

> define a simple base interface (that can cover the HPC use case of

> submitting jobs to a compute cluster), I would like to present the

> following straw man proposal for feedback from this community:

> · Restructure the JSDL specification as a small core

> specification that must be universally implemented – i.e. not just

> parsable, but also suppliable by all compliant job submission services –

> and a number of optional extension profiles.

Sounds sensible.

> · Declare concepts such as executable path, command-line

> arguments, environment variables, and working directory to be generic

> and include them in the core JSDL specification rather than the posix

> extension. This may enable the core specification to support things

> like Windows-based jobs (TBD). The goal here is to define a core JSDL

> specification that in-and-of-itself could enable job submission to a

> fairly wide range of execution subsystems, including both the

> Unix/Linux/Posix world and the Windows world.

Again, it's not quite clear to me that all those concepts are meaningful

in all job types (as opposed to those that are clearly just a way to

execute some binary with a bunch of arguments).

> · Move data staging to an extension.

I'm not sure about this.

> · Create precise definitions of the various concepts introduced

> in the data staging extension, including normative requirements about

> whether or not one can change directory up and out of a file system’s

> root directory, etc.

Good idea.

> · Define which transports are expected to be implemented by all

> compliant services.

Very good idea.

> · Move the various enumeration types – e.g. for CPU architecture

> and OS – to separate specification documents so that they can evolve

> without requiring corresponding and constant revision of the core JSDL

> specification.

Excellent idea. :-)

> · Define extension profiles (eventually, not right away) that

> enable richer description of hardware and software requirements, such as

> details of the CPU architecture or OS capabilities. As part of this,

> move optimization hints, such as CPU speed and network bandwidth

> elements out of the JSDL core and into a separate extension profile.

Sounds pretty sensible to me.

> · Embrace the issue of how to specify available resources at an

> execution subsystem. Start by defining a base case that allows the

> description of compute clusters by creating a compound JSDL document

> that consists of an outer element that ties together a sequence of

> individual JSDL elements, each of which describes a single compute node

> of a compute cluster. Define an explicit notion of extension profiles

> that could define other ways of describing computational resources

> beyond just an array of simple JSDL descriptions.

Interesting. Probably a good topic for discussion going forward.

> Now, as presented above, my straw man proposal looks like suggestions

> for changes that might go into a JSDL-1.1 or JSDL-2.0 specification. In

> the near-term, the HPC profile working group will be exploring what can

> be done with just JSDL-1.0 and restrictions to that specification. The

> restrictions would correspond to disallowing those parts of the JSDL-1.0

> specification that the above proposal advocates moving to extension

> profiles. It will also explore whether a restricted version of the

> posix extension could be used to cover most common Windows cases.

Sounds like a reasonable plan to me.

Donal.