Network outage 3rd March | Manchester VPS

12:48PM: We are currently aware that some customers are experiencing a loss of network connectivity to/from their VPS. We are investigating the issue and will update this post with further information once we know more.

Update 12:52PM: The data centre have just posted a status update advising that network incident has just occurred causing the issue, which they are investigating.

We are able to see that all customer VPS affected are running fine, this is purely an issue with the Data Centre’s network. Once they have restored connectivity traffic to/from your VPS will resume without any need for action from the customer.

Update 12:57PM: We have noticed that network connectivity has now been restored by the Data Centre. We are waiting for an update from them but it looks like the incident may be over now.

Update 1:04PM: Network connectivity was restored by the Data Centre at about 12:56PM and has been fine since then, so we are assuming for now that the incident is over. Our monitoring systems detected that the incident started at about 12:40PM so total time of the outage was about 15 minutes. As soon as we have heard from the Data Centre what the reason for the outage was we will post details here.

Reason for outage

The following day after the above outage, the Data Centre we are located in sent us an explanation of what happened, which we have posted below. The incident affected all of the Data Centre’s customers. Thankfully they dealt with it very swiftly. In the whole time we have been with them this is the only significant incident we have ever experienced, which is a testament to the excellent facility that it is, and the fantastic staff that work there, who we have met on many occasions and have always been highly impressed by. If any of our customers have any questions about this incident, please do not hesitate to contact us.

"As part of a routine firewall deployment, some new hardware was connected to our network and BGP sessions were configured on our route reflectors for these new devices.

Unfortunately, when these new sessions connected to our route reflectors, they caused an issue with the software we are using to power them – specifically the length of time taken to process and send the full BGP table to the new devices caused the “hold-timer” to expire on a few of the other sessions. This then resulted in these sessions disconnecting and reconnecting, and doing the same thing as the first sessions, causing further sessions to do the same, ultimately resulting in a loop where by the route reflectors were continuously dealing with sessions disconnecting and reconnecting.

This continued until such time as our Network Engineers identified the issue, and were able to log into the route reflectors and manually restart them to clear all the BGP sessions. The first of which was done at 12:50, the second at 12:58. All BGP sessions were re-established with all routes fully exchanged by 13:00 and traffic once again began to flow normally.

It has become apparent from this incident that we have reached a limit with the current Route Reflector software that we are using, and as such we are now in the process of replacing these with both improved hardware, and alternative software which has better handling of long-running tasks that doesn’t result in cpu-starvation of important tasks such as maintaining hold-timers.

Initially we will run these new route-reflectors alongside the old ones to prove their stability, before ultimately retiring the old devices, a further maintenance window will be scheduled for this which you will be notified about in due course."

Reason for outage

Leave a Reply Cancel reply