What We Can Learn from Cloudpocalypse?

As most of you already know, Amazon EC2 had some serious issues last and this week that took down several big web 2.0 services. Unfortunately Get Localization was also affected. Problems started on Thursday afternoon (UTC) when AWS was just about recovering from the first outage. We had survived from the first wave of crashes but the second one took us down.

This was unfortunate incident and the easiest solution would be to blame Amazon for this. However it would be wrong. And here’s why:

Our company has been using AWS now almost three years and this was the first time something like this happened to us. It must have been a really stressful and difficult situation to all AWS engineers and despite that, they were able to recover their systems for most of the customers relatively quickly when considering the complexity and how many were affected (the whole data center).

But the main thing here is that we could have done things better. The point is that we can’t blame others when something like this happens, especially when we are in cloud.

There’s no server or infrastructure that never crashes. Cloud is not different in that sense. But what cloud can offer is an easier and better infrastructure to manage these issues — only when used correctly. The biggest problem for us was that our volumes were lost, the data was not lost but the connection between our server and volumes was not working. We could’ve restored the servers and launched new volumes to non-affected zone but our database servers were down so we were not able to mirror the most current data to new volumes.

We have secondary backups as well but they are not real-time. We could’ve gone back and lost couple of hours of work but instead we decided to wait until AWS fixes this problem. That unfortunately took 20 long hours.

Happily we didn’t lose any data and systems have been up and running since last Friday but this is not something we take lightly.

So what we learned from this?

- Elastic Block Storage (EBS) is clearly a weak point. It’s a persistent storage but the connection between instance and volume is not reliable. It can disconnect and bring the whole database server down. We are figuring a way to replace EBS with some more reliable solution so this cannot happen anymore.

- We cannot trust that availability zones within one region are safe from each other. The problem was occurring now in all availability zones in N. Virginia data center. This was not expected to happen.

- The most critical is the API that should not go down. It’s something that can be seen by our customers customers and we cannot accept that. Getting rid of EBS and locating to different regions should bring the stability and reliability we need.

In retrospect, we could’ve done better but on the bright side we learned a lot how to make our service more reliable. Amazon EC2 is a remarkable platform that gives you the power to distribute your applications to all over the world but we just need to keep in mind what Stan Lee once said:  “With great power there must also come great responsibility”. It’s our responsibility to make the system reliable.

See also: Amazon explains what went wrong