Re: [Nsi-wg] ServiceException needs further details

21 Dec 2011

      On Thu, 15 Dec 2011, John MacAuley wrote:
...
I took and action to start the error handling discussion so that we, as a group, can document 
the error messages and behaviors.  I would like to start it off with when an NSIServiceExcepti
on is returned as a SOAP fault to a request, and when it is returned in a specific failed resp
onse message.
OpenDRAC ...
[snip]

So a lot of these are more policy than mechanism, and could be subject to 
change. Lets focus on the error codes.

What does the SVC prefix stand for? (and why a prefix at all, and why not 
"ERR" or "ERROR", which would be somewhat more intuituve.
...
Here is a list of error messages currently implemented in OpenDRAC.  The list continues to exp
and.  I have kept the text generic with the specific error values being returned in the associ
ated attribute list.  We will also need to agree on the format of the message/errorId.
I think we also need a plan with the error codes and their classification. 
Do we provide the errors in order to tell a user went wrong (in which case 
a string will suffice), or do we provide error codes so a client can 
intelligently handle some cases, or both?

The answer is probably the latter, with some semantics for errorId, which 
can enable the client to automatically classify and potentially recover 
from the error. The distinction between text and variables are somewhat 
artifial and only makes sense for missing of invalid parameters. If we 
assume that the error string is for humans only, the distincition between:

text: Missing parameters: Start time, Dest STP.

and

text: Missing parameters
variables: ["Start time", "Dest STP"]

Is just unneeded complexity. If a client is missing a parameter, it 
probably won't be able to change the request and fill it out automatically 
by looking at the error response. It should just be fixed and send the 
parameter in the first place.

What could make sense is that the NSI agent replies that it understands 
the request, but could for some reason not fulfill it, e.g., a path could 
not be found. In this case the client could retry the request elsewhere. 
However the client should not care about why the request could not be 
fulfilled as it highly unlikely that it would usefull anyway (it would 
however be usefull to provide back to the user, if a second or third 
request fails consecutively).
...
MISSING_PARAMETER, "SVC0001", "Invalid or missing parameter"
UNSUPPORTED_OPTION, "SVC0002", "Parameter provided contains an unsupported value which MUST be
 processed"
How is this different from "invalid" in the previous? Is this a "i know 
this should be supported, but it isn't" ?

Both of these would probably be equivalent to HTTP 400 (BAD_REQUEST), in 
which case a request should not be retried with being modified. While I 
can see the distinction between a missing, invalid, or unsupported, the 
end result is the same - human intervention is needed.

In the case where the service knows the semantics, but hasn't implemented 
it, the HTTP 501 (NOT_IMPLEMENTED) would be suitable.
...
ALREADY_EXISTS, "SVC0003", "Schedule already exists for connectionId"
Maybe "CONNECTION_EXISTS" or "CONNECTION_CONFLICT" as name.

This would be equivalent to HTTP 409 (CONFLICT).
...
DOES_NOT_EXIST, "SVC0004", "Schedule does not exists for connectionId"
Maybe "CONNECTION_NONEXISTENT".

This would be equivalent to HTTP 404 (NOT_FOUND)
...
MISSING_SECURITY, "SVC0005", "Invalid or missing user credentials"
The termin "Missing security is highly misleading. I strong suggest 
something else, perhaps: "UNUATHORIZED".

Would be equivalent to HTTP 401 (UNAUTHORIZED).
...
TOPOLOGY_RESOLUTION_STP, "SVC0006", "Could not resolve STP in Topology database"
TOPOLOGY_RESOLUTION_STP_NSA, "SVC0007", "Could not resolve STP to managing NSA"
3 or 4 consecutive nouns following each other make a rather poor error 
name IMHO. Also topology and NSI are so interwoven, that we don't really 
need the topology word. How about "UNKNOWN_STP"?

Do we expect the latter error to ever come up (we know the stp, but not 
the nsa for it - i would call this a topology description error).

These would correspond to HTTP 422 (UNPROCESABLE_ENTITY).
...
PATH_COMPUTATION_NO_PATH, "SVC0008", "Path computation failed to resolve route for reservation
Do we really need to say path twice? How about "NO_PATH_FOUND".

For http this would probably also be 422, though this one does not have a 
clear fit.
...
INVALID_STATE, "SVC0009", "Connection state machine is in invalid state for received message"
Invalid state has a bad ring to it. How about "INVALID_TRANSITION.

For http this would be 422, though 405/406 could be misued for them.
...
INTERNAL_ERROR, "SVC0010", "An internal error has caused a message processing failure"
Would correspond to 500.
...
INTERNAL_NRM_ERROR, "SVC0011", "An internal NRM error has caused a message processing failure"
The distinction between NSA and NRM is an artificial one, and in some 
cases they are the same (e.g., OpenNSA can speak directly to JunOS boxes). 
For the client, the result is the same: "The thing in the other didn't 
work". For humans/operators the distinction is important, but I would say 
the error code is for clients, and the error string for humans.
...
STP_ALREADY_IN_USE, "SVC0012", "Specified STP already in use"
I would call this "STP_UNAVALABLE", as we are dealing with a time span for 
the reservation. The "In use" reflects a current sitauation, which is 
rarely the case for us. The message should be something like "Specified 
STP not available in specified time span".

In HTTP this one is a bit tricky, but 422 is probably the best fitting.
...
BANDWIDTH_NOT_AVAILABLE, "SVC0013", "Insufficent bandwidth available for reservation"
Would people like to add to the list?
Maybe something for a connection which used to exist, but is now 
terminated or no longer available. I know this could fall under 
"DOES_NOT_EXIST", or "INVALID_STATE", but none of these actually capture 
what happened.

This would be equivalent to 410 (GONE).

Furthermore, something stating the the resource is not available for the 
specified user could be appropiate (corresponding to 401 (UNAUTHORIZED) in 
http.

I've given mappings to HTTP status code the error codes. Most mappins are 
straightforward, but a couple are a bit edge and can be discussed. While 
not perfect, http codes are well understood by many developers, have clear 
semantics for request retry and modification, and have been well tested 
over a significant amount of time. Why do we need to invent our own? Of 
course we would only adapt the 400/500 class codes as the other classes 
does not make sense with our current protocol model.

     Best regards, Henrik

  Henrik Thostrup Jensen <htj at ndgf.org>
  NORDUnet / Nordic Data Grid Facility.