Hi all,
The following is dial-in information for
Wednesday's NSI call, time: 7:00
PDT 10:00
EDT, 15:00
GMT, 16:00
CET, 24:00 JST
1. Dial Toll-Free Number: 866-740-1260
(U.S. & Canada) 2. International participants dial: Toll Number: 303-248-0285
Or International Toll-Free Number: http://www.readytalk.com/intl
3. Enter 7-digit access code 8937606, followed by “#”
Agenda:
1. Firewall issues: John Macauley
2. Error Handling: Henrik
3. Other topics
Thanks
Inder
Henrik's email attached:
--
Failure scenarios and recovery for the NSI protocol version 1.0
and 1.1
== Introduction ==
The main focus will be on the control plane interaction, and how to
deal with
message loss, crashes, and how to recover from them.
With the exception of the forcedEnd primitives, all NSI control
plane
message interactions happens like this:
Requester NSA Provider NSA
operation ->
<- operation received ack
<- operation result
operation result ack ->
The main idea between the separation of the operation and operation
result is
that they may be separated by a significant time, especially the
provision
operation which can be separated with several days or months between
the
operation and the result.
For failure scenarios, the loss of any of the four messages should
be
considered along with crashes of one or both of the NSAs at any
point in time.
These failure scenarios can be generalized into the availability of
an NSA,
i.e., it does not matter if it is the network or NSA that is down,
the
distinction is if the NSA received the message or not.
In general the problem is to ensure that the (intended) state of a
connection
is kept in sync. There are two significant problems in the current
protocol:
* No clear semantics for the operation received ack
* No clear division of responsibility between requester and provider
Both of these are semantic issues (i.e., behavior), and hence
solving them
should not require any changes for the wire-protocol.
From a theoretical point of view and assuming an asynchronous
network model
(note that async means something else in distributed systems than in
networks)
the problem is impossible to solve. Taking a slightly less
pessimistic view
(i.e., a partial synchronous network model), it becomes possible to
recover
some failures. Taking a pragmatic approach most errors are
recoverable, given
that the network and NSAs becomes functional at some point in time.
== Control Plane Failure Scenarios & Recovery ==
The following will go through a range of failure scenarios, and
describe how to
recover from them. Note that some of the scenarios can be solved in
multiple
ways. I've taken the approach that it is the responsibility of the
requester to
ensure that the connection at the at the provider is in the required
state.
A: Requester NSA did not receive the operation received ack.
Note: This failure is equivalent to not being able to dispatch the
message
(here there failure just occurs earlier).
Note: If the operation result message is received within the
timeout, this case
can be ignored.
Potential causes: Message loss, network outage, provider NSA is down
If the requester NSA after a certain amount of time have not
received the
operation received ack it must assume that the connection cannot be
created or
the state change. This can be dealt with in multiple ways:
1. Do nothing (hope it comes up again)
2: An alternative circuit can be found.
3: Tear down the connection and send operation failure up the
tree.
Which strategy to choose here is policy dependent and is up the
individual
implementation and organization. OpenNSA currently does 3.
For the sake of preventing stale connections, the requester can keep
a list of
"dead" connections. The status of these connections can then be
checked at
certain intervals via query and a control primitive for fixing the
status send
if needed.
B: Provider NSA could not deliver the operation received ack
This situation is a special case of scenario A, but seen from the
provider
point of view.
Repeated delivery attempts can be tried, but this an only an
incremental
improvement/optimization and does affect the end result.
The provider should not try and change the state of connection,
besides the
latest received primitive from the requester (do the least
surprising thing).
It is up to the requester to discover the current state (via query)
and change
it if needed.
Since it is the responsibility of the requester to discover the
state, there is
no need for the provider to perform "reverse query". In fact, using
the reverse
query, for connection state update may cause more harm than good, as
having the
provider change the connection status automatically may not be what
the
provider wants (he might have compensated somehow) and does not
follow
the element of least surprise, and leaves the control of the
connection at two
parties.
Alternatively, a "Hi, I'm alive; sorry for the downtime" primitive
be
introduced from provider to requester, which the requester can then
use to fire
off any controlling primitives. This is, however, just an
optimization.
C: Provider NSA could not deliver the operation result message
This case should be handled as described in scenario B.
D: Requester NSA did not receive the operation result message.
This case should be handled as described in scenario A.
E: Operation result ack was not received.
This case should be handled as described in scenario B.
== Data Plane Failure Scenarios & Recovery ==
Data plane failures are somewhat different from control plane
failures. I am
not well-versed in networking and NRMs, but will try to come up with
a
strategy:
In general, I see two sorts of failures:
1. The failure is happening in my local domain.
2. The failure is happening outside my local domain.
This might be an overly simplistic view of things.
We assume that any fail-over, etc. have also failed, so the failure
cannot be
corrected (if it can be corrected quickly, it probably should).
The further handling of a data plane failure will probably be policy
dependent.
For some users the network, might be completely unusable after a
failure,
where some would like to try and have it repaired. However trying to
decide /
figure out where and how this policy should be enforced is a rather
tricky
process, and probably out of scope of NSI for now.
Instead I would suggest sending terminate messages downwards and
forcedEnd
upwards. Once this propagates to the initial requester a
policy-correct action
can be taken. I.e., convert a data-plane failure into a
control-plane issue.
== Recommendation / Action items ==
* Make the exact semantics of the operation received ack clear
Recommendation:
- The message has been received (duh)
- The request is sane
- The request has been serialized (crash safe).
- The specified connection exists (for provision, release,
terminate)
- The request was authorized
This has the following implication:
- Once the operation received ack has been received by the
requester,
the connection should show up on a query result. If we cannot
expect
the connection to show after the receival, the primitive should
be
removed as it has no semantic value.
- Failing early will save message exchanges and time.
* Make it clear which of the NSAs has the responsibility for what
Recommendation:
- The provider is the authority for connection status (duh)
- Keeping connection state synchronized is the responsibility of the
requester
This has the following implication:
- Any (non-scheduled) connection state change must only be done at
the
initiative of the requester
- The requester query interface is not needed.
--