Hi,

Participants: Daniel, Mariusz, Roger, Andre (SAGA), Peter

Line 707 - Reaction on reaching soft / hard limits
- Grid Engine: Signal depends on particular limit type
- Agreement that crossing a hard limit should lead to FAILED state of
the DRMAA job
- Agreement to remove softResourceLimits completely, since DRMAA cannot
promise any kind of common semantics, and since the attribute is not
important enough to add it as opaque concept (as with slots)

i promised to do some research, so:

we are mixing different resources wich limits have different purpose
and thus associated policy:

enum ResourceLimitType { CORE_FILE_SIZE , CPU_TIME , DATA_SEG_SIZE
, FILE_SIZE , OPEN_FILES , STACK_SIZE , VIRTUAL_MEMORY
, WALLCLOCK_TIME };

lets take the first one:

CORE_FILE_SIZE  and Grid Engine

man queue_conf: " The  remaining parameters in the queue configuration
template specify per job soft and hard resource limits as implemented
by the setrlimit(2) ..."

man setrlimit " RLIMIT_CORE Maximum size of core file. When 0 no core
dump files are created.  When non-zero, larger dumps are truncated to
this size."

and the difference between Soft and Hard limit is defined as follows:
" The hard limit acts as a  ceiling  for  the  soft  limit:  an
unprivileged  process  may only set its soft limit to a value in the
range from 0 up to the hard limit, and (irreversibly) lower its hard
limit."

exceeding other limits like OPEN_FILES would result just in errors on
calls like open() which application can handle end exits with 0.

So the agreement that "crossing a hard limit should lead to FAILED"
should be valid only to some of the limits e.g.: WALLCLOCK_TIME,
CPU_TIME.

That's an issue. I see basically three options here: 

1) We define the hard limit violation behavior per parameter. In this case, we could add the soft limits again with the same approach.
2) We declare the job termination as MAY happen at any time after violation, and stick with leaving out the soft limits.
3) We drop resource limits completely.

Number 1 is most explicit (== good), but demands careful research on operating system level. Number 2 is our usual safe net. Number 3 is as explicit as number 1, but people may miss the feature.And no, doint it the 'slots' way is not an option ;-) ...

Best regards,
Peter.