Hi;

Coming from the point-of-view of the HPC Profile working group, I have several questions about JSDL, as well as some straw man thoughts about how JSDL should/could relate to the HPC Profile specification that I’m involved with.  Some of my questions lead me to restrictions on JSDL that an HPC profile specification might make.  Other questions lead to potential changes that might be made as part of creating future versions of JSDL.  (I’m well aware that JSDL 1.0 was meant as a starting point rather than the final word on job submission descriptions and so please interpret my questions as being an attempt at constructive suggestions rather than a criticism of a very fine first step by the JSDL working group.)

 

At a high level, there are several general questions that came up when reading the JSDL 1.0 specification:

·        Can JSDL documents describe jobs other than Linux/Unix/Posix jobs?  For example, things like mount points and mount sources do not map in a completely straight-forward manner to how file systems are provided in the Windows world.

·        Is JSDL expressive enough to describe all the needs of a job?  For example, it is unclear how one would specify a requirement for something like a particular instruction set variation of the IA86 architecture (e.g. the SSE3 version of the Pentium) or how one would specify that AMD processors are required rather than Intel ones (because the optimized libraries and the optimizations generated by the compiler used will differ for each).  For another example, it is unclear how one would specify that all the compute nodes used for something like an MPI job should have the same hardware.

·        How will JSDL’s normative set of enumeration values for things like processor architecture and operating system be kept up-to-date and relevant?  Also, how should things like operating system version get specified in a normative manner that will enable interoperability among multiple clients and job scheduling services?  For example, things like Linux and Windows versions are constantly being introduced, each with potentially significant differences in capabilities that a job might depend on.  Without a normative way of specifying these constantly evolving version sets it will be difficult, if not impossible, to create interoperable job submission clients and job scheduling services (including meta-scheduling services where multiple schedulers must interoperate with each other).

·        Although JSDL specifies a means of including additional non-normative elements and attributes in a document, non-normative extensions make interoperability difficult.  This implies the need for normative extensions to JSDL beyond the Posix extension currently described in the 1.0 specification.  Are there plans to define additional extension profiles to address the above questions surrounding expressive power and normative descriptions of things like current OS types and versions?

·        If one accepts the need for a variety of extension profiles then this raises the question of what should be in the base case.  For example, it could be argued that data staging – with its attendant aspects such as mount points and mount sources – should be defined in an extension rather than in the core specification that will need to cover a variety of systems beyond just Linux/Unix/Posix.  Similarly, one might argue that the base case should focus on what’s functionally necessary to execute a job correctly and should leave things that are “optimization hints”, such as CPU speed and network bandwidth specifications, to extension profiles.

·        How are concepts such as IndividualCPUSpeed and IndividualNetworkBandwidth intended to be defined and used in practice?  I understand the concept of specifying things like the amount of physical memory or disk space that a job will require in order to be able to run.  However, CPU speed and network bandwidth don’t represent functional requirements for a job – meaning that a job will correctly run and produce the same results irrespective of the CPU speed and network bandwidth available to it.  Also, the current definitions seem fuzzy: the megahertz number for a CPU does not tell you how fast a given compute node will be able to execute various kinds of jobs, given all the various hardware factors that can affect the performance of a processor (consider the presence/absence of floating point support, the memory caching architecture, etc.).  Similarly, is network bandwidth meant to represent the theoretical maximum of a compute node’s network interface card?  Is it expected to take into account the performance of the switch that the compute node is attached to?  Since switch performance is partially a function of the pattern of (aggregate) traffic going through it, the network bandwidth that a job such as an MPI application can expect to receive will depend on the type of communications patterns employed by the application.  How should this aspect of network bandwidth be reflected – if at all – in the network bandwidth values that a job requests and that compute nodes advertise?

·        JSDL is intended for describing the requirements of a job being submitted for execution.  To enable matchmaking between submitted jobs and available computational resources there must also be a way of describing existing/available resources.  While much of JSDL can be used for this purpose, it is also clear that various extensions are necessary.  For example, to describe a compute cluster requires that one be able to specify the resources for each compute node in the cluster (which may be a heterogeneous lot).  Similarly, to describe a compute node with multiple network interfaces would require an extension to the current model, which assumes that only a single instance of such things can exist.  This raises the question of whether something other than JSDL is intended to be used for describing available computational resources or whether there are intensions to extend JSDL to enable it to describe such resources. 

·        The current specification stipulates that conformant implementations must be able to parse all the elements and attributes defined in the spec, but doesn’t require that any of them be supplied.  Thus, a scheduling service that does nothing could claim to be compliant as long as it can correctly parse JSDL documents.  For interoperability purposes, I would argue that the spec should define a minimum set of elements that any compliant service must be able to supply. Otherwise clients will not be able to make any assumptions about what they can specify in a JSDL document and, in particular, client applications that programmatically submit job submission requests will not be possible since they can’t assume that any valid JSDL document will actually be acceptable by any given job submission service.

·        I have a number of questions about data staging:

·        Although the notions of working directory and environment variables are defined in the posix extension, they are implicitly assuming in the data staging section of the core specification.  This implies to me that either (a) data staging is made an extension or (b) these concepts are made a normative, required part of the core specification.

·        Recursive directory copying can be specified, but is not required to be supplied by any job submission service.  This makes it difficult to write applications that programmatically define their data staging needs since they cannot in the current design determine whether any given job submission service implements recursive directory copying.  In practice this may mean that programmatically generated job submissions will only ever use lists of individual files to stage. 

·        The current definitions of the well-known file systems seem imprecise to me.  In particular:

·        What are the navigation rules associated with each?  Can you cd out of the subtree that each represents?  ROOT almost certainly does not allow that.  Is there an assumption that one can cd out of HOME or TMP or SCRATCH?  Hopefully not, since that would make these file systems even more Unix/Linux-centric, plus one would now need to specify what clients can expect to see when they do so.

·        What is ROOT intended to be used for?  Are there assumptions about what resides under root?  Are there assumptions about what an application can read/write under the ROOT subtree?  (ROOT also seems like the most Unix-specific of the 4 file system types defined.)

·        What are the sharing/consistency semantics of each file system in situations where a job is a multi-node application running on something like a cluster?  Is HOME visible to all compute nodes in a data-consistent manner?  I’m guessing that TMP would be assumed to be strictly local to each compute node, so that things like MPI applications would need to be cognizant that they are writing multiple files to multiple separate storage systems when they write to a file in TMP – and furthermore that data staging of such files after a job has run will result in multiple files that all map to the same target file.

·        Can other users write over or delete your data in TMP and/or SCRATCH?  Is data in these file systems visible to other users or does each job get its own private TMP and SCRATCH?

·        How long does data in SCRATCH stay around?  Without some normative definition – or at least a normative lower bound – on data lifetime clients will have to assume that the data can vanish arbitrarily and things like multi-job workflows will be very difficult to write if they try to take advantage of SCRATCH space to avoid unnecessary data staging actions to/from a computing facility.

·        From an interoperability and programmatic submission point-of-view, it is important to know which transports any given job submission service can be expected to support.  This seems like another area where a normative minimal set that all job submission services must implement needs to be defined.

 

Given these questions, as well as the mandate for the HPC profile to define a simple base interface (that can cover the HPC use case of submitting jobs to a compute cluster), I would like to present the following straw man proposal for feedback from this community:

·        Restructure the JSDL specification as a small core specification that must be universally implemented – i.e. not just parsable, but also suppliable by all compliant job submission services – and a number of optional extension profiles.

·        Declare concepts such as executable path, command-line arguments, environment variables, and working directory to be generic and include them in the core JSDL specification rather than the posix extension.  This may enable the core specification to support things like Windows-based jobs (TBD).  The goal here is to define a core JSDL specification that in-and-of-itself could enable job submission to a fairly wide range of execution subsystems, including both the Unix/Linux/Posix world and the Windows world.

·        Move data staging to an extension.

·        Create precise definitions of the various concepts introduced in the data staging extension, including normative requirements about whether or not one can change directory up and out of a file system’s root directory, etc.

·        Define which transports are expected to be implemented by all compliant services.

·        Move the various enumeration types – e.g. for CPU architecture and OS – to separate specification documents so that they can evolve without requiring corresponding and constant revision of the core JSDL specification.

·        Define extension profiles (eventually, not right away) that enable richer description of hardware and software requirements, such as details of the CPU architecture or OS capabilities.  As part of this, move optimization hints, such as CPU speed and network bandwidth elements out of the JSDL core and into a separate extension profile.

·        Embrace the issue of how to specify available resources at an execution subsystem.  Start by defining a base case that allows the description of compute clusters by creating a compound JSDL document that consists of an outer element that ties together a sequence of individual JSDL elements, each of which describes a single compute node of a compute cluster.  Define an explicit notion of extension profiles that could define other ways of describing computational resources beyond just an array of simple JSDL descriptions.

 

Now, as presented above, my straw man proposal looks like suggestions for changes that might go into a JSDL-1.1 or JSDL-2.0 specification.  In the near-term, the HPC profile working group will be exploring what can be done with just JSDL-1.0 and restrictions to that specification.  The restrictions would correspond to disallowing those parts of the JSDL-1.0 specification that the above proposal advocates moving to extension profiles.  It will also explore whether a restricted version of the posix extension could be used to cover most common Windows cases.

 

 

Marvin.