Yesterdays Outage

Andrew

All American
Administrator
Joined
Oct 12, 2011
Messages
12,735
Yesterday around 1PM our server became unreachable. It became unreachable not because the server itself was having issues but rather the network the server was attached too was having issues. Unfortunately this has been the third such incident from our host since our server was moved out of the Dallas datacenter and into the Provos datacenter. It is now very apparent our host is incapable of maintaining a network on a consistent basis and having any time of redundancy. There are the types of mistakes that we can't live with.

We will be moving servers in the coming weeks and we will keep you posted. We have a lot of custom code and configuration and our site takes up a large amount of space so we have some prep work ahead that will make this move not so instant.

Andrew
 
Advertisement
Yesterday around 1PM our server became unreachable. It became unreachable not because the server itself was having issues but rather the network the server was attached too was having issues. Unfortunately this has been the third such incident from our host since our server was moved out of the Dallas datacenter and into the Provos datacenter. It is now very apparent our host is incapable of maintaining a network on a consistent basis and having any time of redundancy. There are the types of mistakes that we can't live with.

We will be moving servers in the coming weeks and we will keep you posted. We have a lot of custom code and configuration and our site takes up a large amount of space so we have some prep work ahead that will make this move not so instant.

Andrew

Less talk. More wins.
 
Advertisement
Sounds like Dorito blaming his players

292d, 7h, 54m, 50s

That is the uptime of our actual server. Yesterday was not a server crash. This is the message we got from our host. Clearly they blame a firmware bug but make no mention of redundancy which should be in place of their network.

Greetings,

While most all services have been restored as of this time, I'd like to first note that we're still working towards tying up the final lose ends. These remaining issues are absolutely a priority for us currently. In the meantime we do want to provide some more information and answer everyone's questions with what details we do have available currently.

Q: What happened?
A: We experienced a degradation of network service in one of our data centers due to a firmware bug in one of our vendor’s hardware solutions. This was an undocumented bug and we worked with our partner to diagnose the issue and deployed a firmware update to the systems to remediate the problem. Only websites that were being served by this hardware were affected.

Q: Was this related to any previous outage?
A: No, this is unrelated to any previous outages.

Q. Have you identified the problem?
A. Yes, we have isolated the problem to this firmware failure and the downstream effects that resulted from it. We have reviewed our entire network to make sure this problem will not occur elsewhere.

Q. Why did it take so long to address the problem?
A. We started to address the problem immediately when we began to see performance issues. The root cause of the problem was complicated to diagnose because it was an undocumented bug in software of a vendor’s hardware solution. Full service for some customers was restored immediately, but some servers were not visible on our network. We apologize for any downtime that you experienced. The servers continued to operate during this entire period, which means, that at no point in time was your data at risk. The problem was access to the servers because of the firmware issue.

Q. What happened to any email that was sent to me while this firmware issue was affecting the network?
A. There is good and bad news. Unfortunately, any message that was sent to you while we were experiencing this issue would not have been delivered, however the sender should receive a notice that their mail wasn't delivered and most mail servers will continue to try to re-send that email at periodic intervals, anywhere from 2 days to up to 7 days. While we cannot guarantee that any emails sent to you will be delivered, there is a very good chance that it will arrive...slightly delayed.

Q: How has Endurance's involvement with HostGator affected the situation?
A. Actually, this was not a result of Endurance. In fact, the team at our corporate headquarters was tremendously helpful in our recovery effort. They stayed with us throughout the entire incident. By committing the resources of the entire company, including technicians, customer service reps, and engineers, we were able to swarm the problem and address it as quickly as possible.

Q. Why did you leave SoftLayer?
A. We moved out of SoftLayer to be able to more fully control our server environment to provide a better customer experience. We work really hard to prevent issues like this from happening. We recognize that this transition has not been as smooth as either you or we would like and we take the issues that have occurred very seriously. We believe in the long run this is the best environment to deliver service to you.

Q. Do I have to worry about this happening again?
A. We would like to say that we will never have a network service outage again, but realistically that isn’t something we can promise. What we can assure you is that we are continually taking steps to audit and improve the performance of our infrastructure, and investing a large amount of capital and people to do this.

Last and certainly not least, I want to thank everyone for your extreme patience throughout this. We realize the situation is hugely frustrating, but we look forward to getting this resolved for you all and hopefully moving forward stronger.

 
Advertisement
Sounds like Dorito blaming his players

292d, 7h, 54m, 50s

That is the uptime of our actual server. Yesterday was not a server crash. This is the message we got from our host. Clearly they blame a firmware bug but make no mention of redundancy which should be in place of their network.

Greetings,

While most all services have been restored as of this time, I'd like to first note that we're still working towards tying up the final lose ends. These remaining issues are absolutely a priority for us currently. In the meantime we do want to provide some more information and answer everyone's questions with what details we do have available currently.

Q: What happened?
A: We experienced a degradation of network service in one of our data centers due to a firmware bug in one of our vendor’s hardware solutions. This was an undocumented bug and we worked with our partner to diagnose the issue and deployed a firmware update to the systems to remediate the problem. Only websites that were being served by this hardware were affected.

Q: Was this related to any previous outage?
A: No, this is unrelated to any previous outages.

Q. Have you identified the problem?
A. Yes, we have isolated the problem to this firmware failure and the downstream effects that resulted from it. We have reviewed our entire network to make sure this problem will not occur elsewhere.

Q. Why did it take so long to address the problem?
A. We started to address the problem immediately when we began to see performance issues. The root cause of the problem was complicated to diagnose because it was an undocumented bug in software of a vendor’s hardware solution. Full service for some customers was restored immediately, but some servers were not visible on our network. We apologize for any downtime that you experienced. The servers continued to operate during this entire period, which means, that at no point in time was your data at risk. The problem was access to the servers because of the firmware issue.

Q. What happened to any email that was sent to me while this firmware issue was affecting the network?
A. There is good and bad news. Unfortunately, any message that was sent to you while we were experiencing this issue would not have been delivered, however the sender should receive a notice that their mail wasn't delivered and most mail servers will continue to try to re-send that email at periodic intervals, anywhere from 2 days to up to 7 days. While we cannot guarantee that any emails sent to you will be delivered, there is a very good chance that it will arrive...slightly delayed.

Q: How has Endurance's involvement with HostGator affected the situation?
A. Actually, this was not a result of Endurance. In fact, the team at our corporate headquarters was tremendously helpful in our recovery effort. They stayed with us throughout the entire incident. By committing the resources of the entire company, including technicians, customer service reps, and engineers, we were able to swarm the problem and address it as quickly as possible.

Q. Why did you leave SoftLayer?
A. We moved out of SoftLayer to be able to more fully control our server environment to provide a better customer experience. We work really hard to prevent issues like this from happening. We recognize that this transition has not been as smooth as either you or we would like and we take the issues that have occurred very seriously. We believe in the long run this is the best environment to deliver service to you.

Q. Do I have to worry about this happening again?
A. We would like to say that we will never have a network service outage again, but realistically that isn’t something we can promise. What we can assure you is that we are continually taking steps to audit and improve the performance of our infrastructure, and investing a large amount of capital and people to do this.

Last and certainly not least, I want to thank everyone for your extreme patience throughout this. We realize the situation is hugely frustrating, but we look forward to getting this resolved for you all and hopefully moving forward stronger.


Did not read.

What page of the Binder is that on?

All I see in the table of contents is Excuses: System Outages; see also Defense
 
Advertisement
Yesterday around 1PM our server became unreachable. It became unreachable not because the server itself was having issues but rather the network the server was attached too was having issues. Unfortunately this has been the third such incident from our host since our server was moved out of the Dallas datacenter and into the Provos datacenter. It is now very apparent our host is incapable of maintaining a network on a consistent basis and having any time of redundancy. There are the types of mistakes that we can't live with.

We will be moving servers in the coming weeks and we will keep you posted. We have a lot of custom code and configuration and our site takes up a large amount of space so we have some prep work ahead that will make this move not so instant.

Andrew

Trust the process...
 
Advertisement
No due diligence when choosing this hosting company to make sure redundant backbone connections are in place? Sorry had to ask.
 
Advertisement
Back
Top