Hi;
Coming from the point-of-view of the HPC Profile working group, I have
several questions about JSDL, as well as some straw man thoughts about how JSDL
should/could relate to the HPC Profile specification that I’m involved
with. Some of my questions lead me to restrictions on JSDL that an HPC
profile specification might make. Other questions lead to potential
changes that might be made as part of creating future versions of JSDL.
(I’m well aware that JSDL 1.0 was meant as a starting point rather than the
final word on job submission descriptions and so please interpret my questions
as being an attempt at constructive suggestions rather than a criticism of a
very fine first step by the JSDL working group.)
At a high level, there are several general questions that came up when
reading the JSDL 1.0 specification:
·
Can JSDL documents describe jobs
other than Linux/Unix/Posix jobs? For example, things like mount points
and mount sources do not map in a completely straight-forward manner to how
file systems are provided in the Windows world.
·
Is JSDL expressive enough to
describe all the needs of a job? For example, it is unclear how one would
specify a requirement for something like a particular instruction set variation
of the IA86 architecture (e.g. the SSE3 version of the Pentium) or how one
would specify that AMD processors are required rather than Intel ones (because
the optimized libraries and the optimizations generated by the compiler used
will differ for each). For another example, it is unclear how one would
specify that all the compute nodes used for something like an MPI job should
have the same hardware.
·
How will JSDL’s normative
set of enumeration values for things like processor architecture and operating
system be kept up-to-date and relevant? Also, how should things like
operating system version get specified in a normative manner that will enable
interoperability among multiple clients and job scheduling services? For
example, things like Linux and Windows versions are constantly being introduced,
each with potentially significant differences in capabilities that a job might
depend on. Without a normative way of specifying these constantly
evolving version sets it will be difficult, if not impossible, to create
interoperable job submission clients and job scheduling services (including
meta-scheduling services where multiple schedulers must interoperate with each
other).
·
Although JSDL specifies a means of
including additional non-normative elements and attributes in a document,
non-normative extensions make interoperability difficult. This implies
the need for normative extensions to JSDL beyond the Posix extension currently
described in the 1.0 specification. Are there plans to define additional
extension profiles to address the above questions surrounding expressive power
and normative descriptions of things like current OS types and versions?
·
If one accepts the need for a
variety of extension profiles then this raises the question of what should be
in the base case. For example, it could be argued that data staging
– with its attendant aspects such as mount points and mount sources
– should be defined in an extension rather than in the core specification
that will need to cover a variety of systems beyond just
Linux/Unix/Posix. Similarly, one might argue that the base case should
focus on what’s functionally
necessary to execute a job correctly and should leave things that are
“optimization hints”, such as CPU speed and network bandwidth
specifications, to extension profiles.
·
How are concepts such as
IndividualCPUSpeed and IndividualNetworkBandwidth intended to be defined and
used in practice? I understand the concept of specifying things like the
amount of physical memory or disk space that a job will require in order to be
able to run. However, CPU speed and network bandwidth don’t
represent functional requirements for a job – meaning that a job will
correctly run and produce the same results irrespective of the CPU speed and
network bandwidth available to it. Also, the current definitions seem
fuzzy: the megahertz number for a CPU does not tell you how fast a given
compute node will be able to execute various kinds of jobs, given all the
various hardware factors that can affect the performance of a processor
(consider the presence/absence of floating point support, the memory caching
architecture, etc.). Similarly, is network bandwidth meant to represent
the theoretical maximum of a compute node’s network interface card?
Is it expected to take into account the performance of the switch that the
compute node is attached to? Since switch performance is partially a
function of the pattern of (aggregate) traffic going through it, the network
bandwidth that a job such as an MPI application can expect to receive will
depend on the type of communications
patterns employed by the application. How should this aspect of network
bandwidth be reflected – if at all – in the network bandwidth
values that a job requests and that compute nodes advertise?
·
JSDL is intended for describing
the requirements of a job being submitted for execution. To enable
matchmaking between submitted jobs and available computational resources there
must also be a way of describing existing/available resources. While much
of JSDL can be used for this purpose, it is also clear that various extensions
are necessary. For example, to describe a compute cluster requires that
one be able to specify the resources for each compute node in the cluster
(which may be a heterogeneous lot). Similarly, to describe a compute node
with multiple network interfaces would require an extension to the current
model, which assumes that only a single instance of such things can
exist. This raises the question of whether something other than JSDL is
intended to be used for describing available computational resources or whether
there are intensions to extend JSDL to enable it to describe such resources.
·
The current specification
stipulates that conformant implementations must be able to parse all the
elements and attributes defined in the spec, but doesn’t require that any
of them be supplied. Thus, a scheduling service that does nothing could
claim to be compliant as long as it can correctly parse JSDL documents.
For interoperability purposes, I would argue that the spec should define a minimum
set of elements that any compliant service must be able to supply. Otherwise
clients will not be able to make any assumptions about what they can specify in
a JSDL document and, in particular, client applications that programmatically
submit job submission requests will not be possible since they can’t
assume that any valid JSDL document will actually be acceptable by any given
job submission service.
·
I have a number of questions about
data staging:
·
Although the notions of working
directory and environment variables are defined in the posix extension, they
are implicitly assuming in the data staging section of the core
specification. This implies to me that either (a) data staging is made an
extension or (b) these concepts are made a normative, required part of the core
specification.
·
Recursive directory copying can be
specified, but is not required to be supplied by any job submission
service. This makes it difficult to write applications that
programmatically define their data staging needs since they cannot in the
current design determine whether any given job submission service implements
recursive directory copying. In practice this may mean that
programmatically generated job submissions will only ever use lists of
individual files to stage.
·
The current definitions of the
well-known file systems seem imprecise to me. In particular:
·
What are the navigation rules
associated with each? Can you cd out of the subtree that each
represents? ROOT almost certainly does not allow that. Is there an
assumption that one can cd out of HOME or TMP or SCRATCH? Hopefully not,
since that would make these file systems even more Unix/Linux-centric, plus one
would now need to specify what clients can expect to see when they do so.
·
What is ROOT intended to be used
for? Are there assumptions about what resides under root? Are there
assumptions about what an application can read/write under the ROOT subtree?
(ROOT also seems like the most Unix-specific of the 4 file system types
defined.)
·
What are the sharing/consistency
semantics of each file system in situations where a job is a multi-node
application running on something like a cluster? Is HOME visible to all
compute nodes in a data-consistent manner? I’m guessing that TMP
would be assumed to be strictly local to each compute node, so that things like
MPI applications would need to be cognizant that they are writing multiple
files to multiple separate storage systems when they write to a file in TMP
– and furthermore that data staging of such files after a job has run
will result in multiple files that all map to the same target file.
·
Can other users write over or delete
your data in TMP and/or SCRATCH? Is data in these file systems visible to
other users or does each job get its own private TMP and SCRATCH?
·
How long does data in SCRATCH stay
around? Without some normative definition – or at least a normative
lower bound – on data lifetime clients will have to assume that the data
can vanish arbitrarily and things like multi-job workflows will be very difficult
to write if they try to take advantage of SCRATCH space to avoid unnecessary
data staging actions to/from a computing facility.
·
From an interoperability and
programmatic submission point-of-view, it is important to know which transports
any given job submission service can be expected to support. This seems
like another area where a normative minimal set that all job submission
services must implement needs to be defined.
Given these questions, as well as the mandate for the HPC profile to
define a simple base interface (that can cover the HPC use case of submitting
jobs to a compute cluster), I would like to present the following straw man
proposal for feedback from this community:
·
Restructure the JSDL specification
as a small core specification that must be universally implemented – i.e.
not just parsable, but also suppliable by all compliant job submission services
– and a number of optional extension profiles.
·
Declare concepts such as
executable path, command-line arguments, environment variables, and working
directory to be generic and include them in the core JSDL specification rather
than the posix extension. This may enable the core specification to
support things like Windows-based jobs (TBD). The goal here is to define
a core JSDL specification that in-and-of-itself could enable job submission to
a fairly wide range of execution subsystems, including both the
Unix/Linux/Posix world and the Windows world.
·
Move data staging to an extension.
·
Create precise definitions of the
various concepts introduced in the data staging extension, including normative
requirements about whether or not one can change directory up and out of a file
system’s root directory, etc.
·
Define which transports are
expected to be implemented by all compliant services.
·
Move the various enumeration types
– e.g. for CPU architecture and OS – to separate specification
documents so that they can evolve without requiring corresponding and constant
revision of the core JSDL specification.
·
Define extension profiles
(eventually, not right away) that enable richer description of hardware and
software requirements, such as details of the CPU architecture or OS
capabilities. As part of this, move optimization hints, such as CPU speed
and network bandwidth elements out of the JSDL core and into a separate
extension profile.
·
Embrace the issue of how to
specify available resources at an execution subsystem. Start by defining
a base case that allows the description of compute clusters by creating a
compound JSDL document that consists of an outer element that ties together a
sequence of individual JSDL elements, each of which describes a single compute
node of a compute cluster. Define an explicit notion of extension
profiles that could define other ways of describing computational resources
beyond just an array of simple JSDL descriptions.
Now, as presented above, my straw man proposal looks like suggestions
for changes that might go into a JSDL-1.1 or JSDL-2.0 specification. In
the near-term, the HPC profile working group will be exploring what can be done
with just JSDL-1.0 and restrictions to that specification. The
restrictions would correspond to disallowing those parts of the JSDL-1.0
specification that the above proposal advocates moving to extension
profiles. It will also explore whether a restricted version of the posix
extension could be used to cover most common Windows cases.
Marvin.