NSI error handling draft

Chin Guok

21 Apr 2010 21 Apr '10

5:49 a.m.

Hi all, I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document. This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading. Comments are most welcomed. Thanks. - Chin

Attachments:

NSIErrorHandlingChin_Inderv2.docx (application/octet-stream — 151.0 KB)

Show replies by date

Radek Krzywania

21 Apr 21 Apr

1:22 p.m.

Hi Chin, all, Nice description. What I can add as ma extension here, is that user may request different "levels" of resiliency. We are in the research of various of those options at inter-domain level for AutoBAHN. We defined 3 attributes that are important from the user perspective for service resiliency (in the sense of circuit availability), which introduces various reactions to failures. The attributes are: - protection vs. restoration mode - intra-domain resiliency enabled / disabled - diversification of domains / domain re-use. The first attribute defines whether backup path is also scheduled/active (protection) or it will be searched at failure moment (restoration). Restoration is same as immediate reservation somehow, and can fail, while protection is rather assumed to be successful. This influence the design of protocol for reservation, as it must negotiate also this attribute. In consequence it states the probability of the resources on backup path will be available, and how fast can they be reachable (switching time). The second attribute defines whether domains has mechanisms to solve issues internally, without affecting global path (which was also included in your paper). If so, this kind of failures are transparent to NSI (NSI MAY be notified), as they can be solved by local network controllers. If at least one domain on a path does not support intra-domain resiliency, we consider whole path to not support it (yes, it's a simplification :) ) The last attribute defines, whether backup path should avoid domains/links from primary path. So in case whole domain is failed in a path, a backup path is not affected and can be used immediately (except circuit source and destination domain, which must be common). We made a matrix of those attributes (values 0/1 for any of three options in all reasonable combination) and received 5 levels of resiliency, that could be requested by users. Now is a matter on how dip are we wish to go with that for NSI. I personally think that resiliency can be weaker or stronger, and may be charged (even virtually) differently according to users credits/requirements. Best regards Radek ________________________________________________________________________ Radoslaw Krzywania Network Research and Development Poznan Supercomputing and radek.krzywania@man.poznan.pl Networking Center +48 61 858 20 28 http://www.man.poznan.pl ________________________________________________________________________

...

-----Original Message----- From: nsi-wg-bounces@ogf.org [mailto:nsi-wg-bounces@ogf.org] On Behalf Of Chin Guok Sent: Wednesday, April 21, 2010 7:49 AM To: nsi-wg@ogf.org Subject: [Nsi-wg] NSI error handling draft

Hi all,

I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document.

This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading.

Comments are most welcomed.

Thanks.

- Chin

John Vollbrecht

22 Apr 22 Apr

9:03 p.m.

This is very nice. A couple comments/ suggestions/ questions 1) I think all the actions you suggest for the transport plane failure are actually taken in the NRM or Service plane. I may be wrong, but that is what it seems to me. If so, then I think it would be helpful to describe the transport device/plane signalling failure to the NRM at different times. Something like this would have made it easier for me to follow. 2) I don't the understand local and remote distinction in the Service Plane failure discussion. Perhaps local meaning NRM and remote meaning reachable through NSI? 3) I am wondering how service plane failures are discovered? Is some sort of session failure? John On Apr 21, 2010, at 1:49 AM, Chin Guok wrote:

...

Hi all,

I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document.

This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading.

Comments are most welcomed.

Thanks.

- Chin<NSI Error Handling Chin_Inder v2.docx>_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

Inder Monga

11:11 p.m.

John, If I may add my $0.02 cents. On Apr 22, 2010, at 2:03 PM, John Vollbrecht wrote:

...

This is very nice.

A couple comments/ suggestions/ questions

1) I think all the actions you suggest for the transport plane failure are actually taken in the NRM or Service plane. I may be wrong, but that is what it seems to me. If so, then I think it would be helpful to describe the transport device/plane signalling failure to the NRM at different times. Something like this would have made it easier for me to follow.

I agree that transport plane failures actions are either handled by the Service Plane aka NSA (for example "reserve alternative local resources") or in the transport plane (for example switch to backup). I do not understand what you mean by "describe the transport device/plane signalling failure to the NRM" - can you please elaborate? The intention of this section was to indicate the error cases which would result in notification to the RA and possible cancelation of a connection. There are cases highlighted where the errors are handled completely by the Service Plane or the Transport plane with no need for notification to the user/RA.

...

2) I don't the understand local and remote distinction in the Service Plane failure discussion. Perhaps local meaning NRM and remote meaning reachable through NSI?

Local implies failure of own domain's RA or PA. Remote means failure of the remote RA or PA. The two cases are diagrammatically the same - the difference is in the context.

...

3) I am wondering how service plane failures are discovered? Is some sort of session failure?

There are a couple of assumptions here: 1. There is reliable messaging between RA and PA 2. There is a timeout if responses are not received from the RA/PA (could be after multiple tries). This timeout could be due to a management network failure between RA and PA. Hope this helps - thanks for your feedback. Inder

...

John

On Apr 21, 2010, at 1:49 AM, Chin Guok wrote:

...
Hi all,

I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document.

This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading.

Comments are most welcomed.

Thanks.

- Chin<NSI Error Handling Chin_Inder v2.docx>_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

--- Inder Monga http://100gbs.lbl.gov imonga@es.net http://www.es.net (510) 499 8065 (c) (510) 486 6531 (o)

John Vollbrecht

26 Apr 26 Apr

5:34 p.m.

On Apr 22, 2010, at 7:11 PM, Inder Monga wrote:

...

John,

If I may add my $0.02 cents.

On Apr 22, 2010, at 2:03 PM, John Vollbrecht wrote:

...
This is very nice.

A couple comments/ suggestions/ questions

1) I think all the actions you suggest for the transport plane failure are actually taken in the NRM or Service plane. I may be wrong, but that is what it seems to me. If so, then I think it would be helpful to describe the transport device/plane signalling failure to the NRM at different times. Something like this would have made it easier for me to follow.

I agree that transport plane failures actions are either handled by the Service Plane aka NSA (for example "reserve alternative local resources") or in the transport plane (for example switch to backup). I do not understand what you mean by "describe the transport device/plane signalling failure to the NRM" - can you please elaborate?

I think there should be a statement something like " Transport plane failures are communicated to the NRM. The NRM deals with these based on the state of the NRM at the time it learns of the failure". The idea is to make it clear that this is dealing with how to deal with transport failures reported to the service plane. One might use NSA instead of NRM - I am not sure which would be more appropriate. I may not have explained this well- please ask questions if it is not clear.

...

The intention of this section was to indicate the error cases which would result in notification to the RA and possible cancelation of a connection. There are cases highlighted where the errors are handled completely by the Service Plane or the Transport plane with no need for notification to the user/RA.

I note that the RA and PA are both in the service plane. Presumably when an NSA with RA receives a fail message from the PA, the Segment/aggregate section of NSA also has state, and how the NSA deal with the message depends on the state of the NSA.

...

...
2) I don't the understand local and remote distinction in the Service Plane failure discussion. Perhaps local meaning NRM and remote meaning reachable through NSI?

Local implies failure of own domain's RA or PA. Remote means failure of the remote RA or PA. The two cases are diagrammatically the same - the difference is in the context. This is still confusing to me. If the session between the RA and PA fails, then isn't everything a local failure - whichever side you are on? If a PA tries to send a message and it does't make, how does it know whether the message got there or not? If it is time when it notices the session fails, this also seems both sides are equivalent.

...
3) I am wondering how service plane failures are discovered? Is some sort of session failure?

There are a couple of assumptions here: 1. There is reliable messaging between RA and PA 2. There is a timeout if responses are not received from the RA/PA (could be after multiple tries). This timeout could be due to a management network failure between RA and PA.

These seem like they could have different consequences. I agree that the service plane failures are less well defined so far. It is good you are thinking about them and starting discussion. The issue of whether NSAs (or only NRMs) keep state after a connection is reserved is another issue that impacts this. John

...

Hope this helps - thanks for your feedback.

Inder

...
John

On Apr 21, 2010, at 1:49 AM, Chin Guok wrote:

...
Hi all,

I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document.

This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading.

Comments are most welcomed.

Thanks.

- Chin<NSI Error Handling Chin_Inder v2.docx>_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

--- Inder Monga http://100gbs.lbl.gov imonga@es.net http://www.es.net (510) 499 8065 (c) (510) 486 6531 (o)

Inder Monga

27 Apr 27 Apr

10:13 p.m.

On Apr 26, 2010, at 10:34 AM, John Vollbrecht wrote:

...

On Apr 22, 2010, at 7:11 PM, Inder Monga wrote:

...
John,

If I may add my $0.02 cents.

On Apr 22, 2010, at 2:03 PM, John Vollbrecht wrote:

...
This is very nice.

A couple comments/ suggestions/ questions

1) I think all the actions you suggest for the transport plane failure are actually taken in the NRM or Service plane. I may be wrong, but that is what it seems to me. If so, then I think it would be helpful to describe the transport device/plane signalling failure to the NRM at different times. Something like this would have made it easier for me to follow.

I agree that transport plane failures actions are either handled by the Service Plane aka NSA (for example "reserve alternative local resources") or in the transport plane (for example switch to backup). I do not understand what you mean by "describe the transport device/plane signalling failure to the NRM" - can you please elaborate?

I think there should be a statement something like " Transport plane failures are communicated to the NRM. The NRM deals with these based on the state of the NRM at the time it learns of the failure". The idea is to make it clear that this is dealing with how to deal with transport failures reported to the service plane. One might use NSA instead of NRM - I am not sure which would be more appropriate. I may not have explained this well- please ask questions if it is not clear.

John, Good point. The assumption is that the transport plane failures are somehow communicated up to the resource manager (NRM) and the reservation manager (NRM/NSA). The mechanism on how that happens is out of scope of the architecture document.

...

...
The intention of this section was to indicate the error cases which would result in notification to the RA and possible cancelation of a connection. There are cases highlighted where the errors are handled completely by the Service Plane or the Transport plane with no need for notification to the user/RA.

I note that the RA and PA are both in the service plane. Presumably when an NSA with RA receives a fail message from the PA, the Segment/aggregate section of NSA also has state, and how the NSA deal with the message depends on the state of the NSA.

Agreed. The aggregation of confirms/communication of failures to children RA/PA pairs etc, all happen due to state kept in a particular NSA. What we have to discuss is how much of that state is stored in recoverable, non-volatile storage i.e. if the NSA software/computer/agent crashes and recovers, does it recover all state or not? What if something was "inflight". These are the cases the state machines and the protocol must be resilient to i.e. recover to a stable state.

...

...
...
2) I don't the understand local and remote distinction in the Service Plane failure discussion. Perhaps local meaning NRM and remote meaning reachable through NSI?

Local implies failure of own domain's RA or PA. Remote means failure of the remote RA or PA. The two cases are diagrammatically the same - the difference is in the context. This is still confusing to me. If the session between the RA and PA fails, then isn't everything a local failure - whichever side you are on? If a PA tries to send a message and it does't make, how does it know whether the message got there or not? If it is time when it notices the session fails, this also seems both sides are equivalent.

Well, yes, I sort of agree. The distinction is not that clear. A local failure is that if the local RA/PA crashes and then comes back up - how does it recover state, interact with neighboring RA/PA pairs, deal with missed provision times or state inconsistencies that might occur. The remote case is, if my peer RA/PA crashes or becomes unavailable, what mechanisms/state transitions does it trigger to clean up and arrive at a stable state regardless of the time it takes for the peer to recover from its failure.

...

...
...
3) I am wondering how service plane failures are discovered? Is some sort of session failure?

There are a couple of assumptions here: 1. There is reliable messaging between RA and PA 2. There is a timeout if responses are not received from the RA/PA (could be after multiple tries). This timeout could be due to a management network failure between RA and PA.

These seem like they could have different consequences.

Absolutely - the effects of both are different. But from a peer NSA perspective, it should not care what the cause of the failure is 1) or 2) - it should have the mechanisms to recover to a consistent state. Hope this helps, Inder

...

I agree that the service plane failures are less well defined so far. It is good you are thinking about them and starting discussion. The issue of whether NSAs (or only NRMs) keep state after a connection is reserved is another issue that impacts this.

John

...
Hope this helps - thanks for your feedback.

Inder

...
John

On Apr 21, 2010, at 1:49 AM, Chin Guok wrote:

...
Hi all,

I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document.

This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading.

Comments are most welcomed.

Thanks.

- Chin<NSI Error Handling Chin_Inder v2.docx>_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

--- Inder Monga http://100gbs.lbl.gov imonga@es.net http://www.es.net (510) 499 8065 (c) (510) 486 6531 (o)

--- Inder Monga http://100gbs.lbl.gov imonga@es.net http://www.es.net (510) 499 8065 (c) (510) 486 6531 (o)

Inder Monga

28 Apr 28 Apr

6:14 a.m.

New subject: NSI error handling draft - next version

Hi All, An updated draft based on comments. We attached a table in the front to summarize and use it for discussions. Look forward to discuss this tomorrow. Thanks, Inder On Apr 20, 2010, at 10:49 PM, Chin Guok wrote:

...

Hi all,

I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document.

This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading.

Comments are most welcomed.

Thanks.

- Chin<NSI Error Handling Chin_Inder v2.docx>_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

John MacAuley

3:50 p.m.

New subject: NSI error handling draft - next version

Peoples, Had someone show up in my office so I missed the conversation over "Resource change from available to not available." I thought I would provide some input on the topic based on my DRAC experiences. I think there are three types of events that can initiate a topology change that should be understood when defining the error handling. Two of these are actually not errors but normal operating procedures within a network: 1. Physical network failure resulting in a topology change - typically the temporary removal of a link from topology with no knowledge of when it will be restored. 2. The permanent removal of a link from the topology by a network administrator. Actually, this one should include the reconfiguration of the network where an entire node could be removed. 3. The temporary removal of a link by a network administrator for maintainence purposes. This will typically have a defined start and end time based on the maintenance window. #1 is interesting in that it impacts existing schedules in an in-service state, reserved schedules not yet in service, and any new reservation requests. a) Those schedules in-service using the links impacted by the topology change may undergo some type of restoration. If this was a protected circuit then underlying transport will restore the service and we may not want to do anything about it. If this was an unprotected service then perhaps re-dial could be initiated by the NRM in an attempt to achieve a lazy restore. b) Depending on the estimated length of the temporary topology change we may need to recompute the paths of those schedules reserved but not yet provisioned. We should not recompute the paths from the point of failure to the end of time but for some predefined floating window optimistic enough to give the failure time to recover, and reduce the amount schedules that would be recomputed. For example, a floating one hour window would mean all reservations up to an hour in the future that could be impacted by the failure can be recomputed. If the failure is cleared and the topology is restored then there is a one hour window that should have been cleared. The interesting side-effect is we now have a window of time to make sure the link remains trouble free. The question is have we blocked that link from use or can a new schedule use the remaining hour if it comes in after the trouble has cleared. c) If a new reservation request for a future point in time arrives while a failure has taken the link out of topology do we remove the link from computation, or do we add an optimistic guard time after which we can assume the link will be restored? #2 is different from a fault condition in that an administrator has removed the link from topology. We can model this gracefully if we can have a high priority (preemptive) administration reservation that can block the bandwidth on a link from the point in time the link will be removed through until infinity. Any schedules this preemptive schedule impacts will need to be recomputed as discussed in the previous example, or if provisioned switched to protection/re-dialed to restore. At some point on or after the start of the preemptive schedule the link can be permanently removed from topology and the reservation blocking that link cleared. #3 is similar to #2 except there is a defined end time for the preemptive schedule blocking the link. Only reservations overlapping with the maintenance window would need to be recomputed. Obviously, any provisioned schedules would need to be switched to protection or re-dialed to restore. John. On 10-04-28 2:14 AM, Inder Monga wrote:

...

Hi All,

An updated draft based on comments. We attached a table in the front to summarize and use it for discussions. Look forward to discuss this tomorrow.

Thanks, Inder

On Apr 20, 2010, at 10:49 PM, Chin Guok wrote:

...
Hi all,

I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document.

This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading.

Comments are most welcomed.

Thanks.

- Chin<NSI Error Handling Chin_Inder v2.docx>_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

Inder Monga

6:38 p.m.

New subject: NSI error handling draft - next version

John, Great points about administrative and maintenance procedures. We would have to make an assumption that the NSA/NRM gets an event with the right "notification" of the reason for topology change - through the OSS/network management platform. Otherwise, we will not be able to differentiate between the cause of the topology change and will not be able to estimate the duration of that change like in case of maintenance. We can assume the default case to be #1 if the not notified of the exact cause. Thanks, inder On Apr 28, 2010, at 8:50 AM, John MacAuley wrote:

...

Peoples,

Had someone show up in my office so I missed the conversation over "Resource change from available to not available." I thought I would provide some input on the topic based on my DRAC experiences.

I think there are three types of events that can initiate a topology change that should be understood when defining the error handling. Two of these are actually not errors but normal operating procedures within a network:

1. Physical network failure resulting in a topology change - typically the temporary removal of a link from topology with no knowledge of when it will be restored.

2. The permanent removal of a link from the topology by a network administrator. Actually, this one should include the reconfiguration of the network where an entire node could be removed.

3. The temporary removal of a link by a network administrator for maintainence purposes. This will typically have a defined start and end time based on the maintenance window.

#1 is interesting in that it impacts existing schedules in an in-service state, reserved schedules not yet in service, and any new reservation requests.

a) Those schedules in-service using the links impacted by the topology change may undergo some type of restoration. If this was a protected circuit then underlying transport will restore the service and we may not want to do anything about it. If this was an unprotected service then perhaps re-dial could be initiated by the NRM in an attempt to achieve a lazy restore.

b) Depending on the estimated length of the temporary topology change we may need to recompute the paths of those schedules reserved but not yet provisioned. We should not recompute the paths from the point of failure to the end of time but for some predefined floating window optimistic enough to give the failure time to recover, and reduce the amount schedules that would be recomputed. For example, a floating one hour window would mean all reservations up to an hour in the future that could be impacted by the failure can be recomputed. If the failure is cleared and the topology is restored then there is a one hour window that should have been cleared. The interesting side-effect is we now have a window of time to make sure the link remains trouble free. The question is have we blocked that link from use or can a new schedule use the remaining hour if it comes in after the trouble has cleared.

c) If a new reservation request for a future point in time arrives while a failure has taken the link out of topology do we remove the link from computation, or do we add an optimistic guard time after which we can assume the link will be restored?

#2 is different from a fault condition in that an administrator has removed the link from topology. We can model this gracefully if we can have a high priority (preemptive) administration reservation that can block the bandwidth on a link from the point in time the link will be removed through until infinity. Any schedules this preemptive schedule impacts will need to be recomputed as discussed in the previous example, or if provisioned switched to protection/re-dialed to restore. At some point on or after the start of the preemptive schedule the link can be permanently removed from topology and the reservation blocking that link cleared.

#3 is similar to #2 except there is a defined end time for the preemptive schedule blocking the link. Only reservations overlapping with the maintenance window would need to be recomputed. Obviously, any provisioned schedules would need to be switched to protection or re-dialed to restore.

John.

On 10-04-28 2:14 AM, Inder Monga wrote:

...
Hi All,

An updated draft based on comments. We attached a table in the front to summarize and use it for discussions. Look forward to discuss this tomorrow.

Thanks, Inder

On Apr 20, 2010, at 10:49 PM, Chin Guok wrote:

...
Hi all,

I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document.

This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading.

Comments are most welcomed.

Thanks.

- Chin<NSI Error Handling Chin_Inder v2.docx>_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

--- Inder Monga http://100gbs.lbl.gov imonga@es.net http://www.es.net (510) 499 8065 (c) (510) 486 6531 (o)

John MacAuley

6:46 p.m.

New subject: NSI error handling draft - next version

I think the unwritten suggestion was that network administrators should have the facilities in the NRM to identify/manage #2 and #3 in a graceful fashion to avoid chaos during scheduled maintenance. They would need to incorporate the additional procedures into their process. We just need to make sure we support the concept of preemptive schedules on topological links (although I know some people who would like a general preemptive schedule to whack existing schedules if needed). John. On 10-04-28 2:38 PM, Inder Monga wrote:

...

John,

Great points about administrative and maintenance procedures.

We would have to make an assumption that the NSA/NRM gets an event with the right "notification" of the reason for topology change - through the OSS/network management platform. Otherwise, we will not be able to differentiate between the cause of the topology change and will not be able to estimate the duration of that change like in case of maintenance. We can assume the default case to be #1 if the not notified of the exact cause.

Thanks, inder

On Apr 28, 2010, at 8:50 AM, John MacAuley wrote:

...
Peoples,

Had someone show up in my office so I missed the conversation over "Resource change from available to not available." I thought I would provide some input on the topic based on my DRAC experiences.

I think there are three types of events that can initiate a topology change that should be understood when defining the error handling. Two of these are actually not errors but normal operating procedures within a network:

1. Physical network failure resulting in a topology change - typically the temporary removal of a link from topology with no knowledge of when it will be restored.

2. The permanent removal of a link from the topology by a network administrator. Actually, this one should include the reconfiguration of the network where an entire node could be removed.

3. The temporary removal of a link by a network administrator for maintainence purposes. This will typically have a defined start and end time based on the maintenance window.

#1 is interesting in that it impacts existing schedules in an in-service state, reserved schedules not yet in service, and any new reservation requests.

a) Those schedules in-service using the links impacted by the topology change may undergo some type of restoration. If this was a protected circuit then underlying transport will restore the service and we may not want to do anything about it. If this was an unprotected service then perhaps re-dial could be initiated by the NRM in an attempt to achieve a lazy restore.

b) Depending on the estimated length of the temporary topology change we may need to recompute the paths of those schedules reserved but not yet provisioned. We should not recompute the paths from the point of failure to the end of time but for some predefined floating window optimistic enough to give the failure time to recover, and reduce the amount schedules that would be recomputed. For example, a floating one hour window would mean all reservations up to an hour in the future that could be impacted by the failure can be recomputed. If the failure is cleared and the topology is restored then there is a one hour window that should have been cleared. The interesting side-effect is we now have a window of time to make sure the link remains trouble free. The question is have we blocked that link from use or can a new schedule use the remaining hour if it comes in after the trouble has cleared.

c) If a new reservation request for a future point in time arrives while a failure has taken the link out of topology do we remove the link from computation, or do we add an optimistic guard time after which we can assume the link will be restored?

#2 is different from a fault condition in that an administrator has removed the link from topology. We can model this gracefully if we can have a high priority (preemptive) administration reservation that can block the bandwidth on a link from the point in time the link will be removed through until infinity. Any schedules this preemptive schedule impacts will need to be recomputed as discussed in the previous example, or if provisioned switched to protection/re-dialed to restore. At some point on or after the start of the preemptive schedule the link can be permanently removed from topology and the reservation blocking that link cleared.

#3 is similar to #2 except there is a defined end time for the preemptive schedule blocking the link. Only reservations overlapping with the maintenance window would need to be recomputed. Obviously, any provisioned schedules would need to be switched to protection or re-dialed to restore.

John.

On 10-04-28 2:14 AM, Inder Monga wrote:

...
Hi All,

An updated draft based on comments. We attached a table in the front to summarize and use it for discussions. Look forward to discuss this tomorrow.

Thanks, Inder

On Apr 20, 2010, at 10:49 PM, Chin Guok wrote:

...
Hi all,

I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document.

This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading.

Comments are most welcomed.

Thanks.

- Chin<NSI Error Handling Chin_Inder v2.docx>_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org <mailto:nsi-wg@ogf.org> http://www.ogf.org/mailman/listinfo/nsi-wg

--- Inder Monga http://100gbs.lbl.gov imonga@es.net <mailto:imonga@es.net> http://www.es.net (510) 499 8065 (c) (510) 486 6531 (o)

John Vollbrecht

4 May 4 May

9:20 p.m.

New subject: NSI error handling draft - next version

Seems like this is a good place to think about the relationship of Management and Service planes. This is -- I think -- different that between transport and service planes. Interesting -- the planes picture might come into its own. John On Apr 28, 2010, at 2:38 PM, Inder Monga wrote:

...

John,

Great points about administrative and maintenance procedures.

We would have to make an assumption that the NSA/NRM gets an event with the right "notification" of the reason for topology change - through the OSS/network management platform. Otherwise, we will not be able to differentiate between the cause of the topology change and will not be able to estimate the duration of that change like in case of maintenance. We can assume the default case to be #1 if the not notified of the exact cause.

Thanks, inder

On Apr 28, 2010, at 8:50 AM, John MacAuley wrote:

...
Peoples,

Had someone show up in my office so I missed the conversation over "Resource change from available to not available." I thought I would provide some input on the topic based on my DRAC experiences.

I think there are three types of events that can initiate a topology change that should be understood when defining the error handling. Two of these are actually not errors but normal operating procedures within a network:

1. Physical network failure resulting in a topology change - typically the temporary removal of a link from topology with no knowledge of when it will be restored.

2. The permanent removal of a link from the topology by a network administrator. Actually, this one should include the reconfiguration of the network where an entire node could be removed.

3. The temporary removal of a link by a network administrator for maintainence purposes. This will typically have a defined start and end time based on the maintenance window.

#1 is interesting in that it impacts existing schedules in an in- service state, reserved schedules not yet in service, and any new reservation requests.

a) Those schedules in-service using the links impacted by the topology change may undergo some type of restoration. If this was a protected circuit then underlying transport will restore the service and we may not want to do anything about it. If this was an unprotected service then perhaps re-dial could be initiated by the NRM in an attempt to achieve a lazy restore.

b) Depending on the estimated length of the temporary topology change we may need to recompute the paths of those schedules reserved but not yet provisioned. We should not recompute the paths from the point of failure to the end of time but for some predefined floating window optimistic enough to give the failure time to recover, and reduce the amount schedules that would be recomputed. For example, a floating one hour window would mean all reservations up to an hour in the future that could be impacted by the failure can be recomputed. If the failure is cleared and the topology is restored then there is a one hour window that should have been cleared. The interesting side-effect is we now have a window of time to make sure the link remains trouble free. The question is have we blocked that link from use or can a new schedule use the remaining hour if it comes in after the trouble has cleared.

c) If a new reservation request for a future point in time arrives while a failure has taken the link out of topology do we remove the link from computation, or do we add an optimistic guard time after which we can assume the link will be restored?

#2 is different from a fault condition in that an administrator has removed the link from topology. We can model this gracefully if we can have a high priority (preemptive) administration reservation that can block the bandwidth on a link from the point in time the link will be removed through until infinity. Any schedules this preemptive schedule impacts will need to be recomputed as discussed in the previous example, or if provisioned switched to protection/ re-dialed to restore. At some point on or after the start of the preemptive schedule the link can be permanently removed from topology and the reservation blocking that link cleared.

#3 is similar to #2 except there is a defined end time for the preemptive schedule blocking the link. Only reservations overlapping with the maintenance window would need to be recomputed. Obviously, any provisioned schedules would need to be switched to protection or re-dialed to restore.

John.

On 10-04-28 2:14 AM, Inder Monga wrote:

...
Hi All,

An updated draft based on comments. We attached a table in the front to summarize and use it for discussions. Look forward to discuss this tomorrow.

Thanks, Inder

On Apr 20, 2010, at 10:49 PM, Chin Guok wrote:

...
Hi all,

I've attached a draft of the error handling section that Inder and I came up with for the NSI Architecture document.

This is a rough first draft, and there are some obvious portions missing, but it gives an idea of where we heading.

Comments are most welcomed.

Thanks.

- Chin<NSI Error Handling Chin_Inder v2.docx>_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

--- Inder Monga http://100gbs.lbl.gov imonga@es.net http://www.es.net (510) 499 8065 (c) (510) 486 6531 (o)

_______________________________________________ nsi-wg mailing list nsi-wg@ogf.org http://www.ogf.org/mailman/listinfo/nsi-wg

5690

Age (days ago)

5703

Last active (days ago)

List overview

Download

10 comments

5 participants

participants (5)

Chin Guok
Inder Monga
John MacAuley
John Vollbrecht
Radek Krzywania

NSI error handling draft

tags

participants (5)