Storm Clouds in VA, Happy CloudMiners all over
One of the benefits that our users enjoy is running their applications on our infrastructure platform. A great example of how this pays off was seen during the Great AWS Outage of 2012 recently. Our API, and more importantly our users’ apps, stayed up despite Amazon’s system-wide failures.
It all started on Friday night when our website and API monitoring systems starting sending us emails that the website is misbehaving and requests were timing out. An investigation commenced.
Reports started flowing in on twitter that the Internet was down (i.e., Netflix, Instagram, Pinterest, and others).
The AWS status page started describing outages in RDS, some availability zones, and EBS systems, attributing them to a power loss due to electrical storms in the area.
Clearly Amazon AWS was not having a good day.
A quick check confirmed that the website was indeed down and was having issues accessing the database instance powering it (running on Amazon’s RDS).
I quickly spun up another database server from a very recent snapshot but due to the prevailing RDS issues, it took about two hours to start. Once it did, the website was back up and running normally.
Our primary concern is the API servers. None of our users’ critical operations are dependent on the website, but all their apps are powered by our APIs.
As it turned out, one of our API database servers stopped responding as well.
But there was no need to panic! Our system is build to handle these kinds of failures. Other database servers picked up the load and the API servers kept happily serving requests (though there were some brief ELB issues as well).
The biggest impact on our API servers was that some requests were taking a longer time to respond because of the database failover, but they did eventually respond. Taking the failed API database out of the configuration immediately resolved those issues. And we’ll make sure that this won’t be a problem in the future.
A potential crisis averted! And some lessons learned.
Running on the cloud is not trivial. In this incident, we had two database servers fail completely (and possibly irrecoverably) with no adverse affects. But most developers run only one database server or application server due to complexity and cost of redundancy. A single server failure would then be catastrophic for that developer.
Our API uptime is very important to us because lots of developers depend on us to keep their apps running and their end users happy. This time, our redundancy and failover provisions kept our primary Amazon servers up and serving customers. However, had that failed, we were ready to failover the entire stack to the other cloud providers. While others went down, sites and apps powered by CloudMine stayed up.
Always, always have backups. We recovered our RDS instance from a fresh snapshot, and had snapshots of our API database ready to spin up if the need arose.
Both of the DB servers that failed (RDS and API) were on the same availability zone - us-east-1d. It seems that this zone got hit harder than the rest. The key to CloudMine staying up and not loosing any data was having multiple redundant API and database servers in several availability zones.
Rebooting misbehaving instances in AWS when there are system-wide issues tends to make things worse. It takes many actions and resources to spin up new instances and/or EBS volumes and more load is placed on an already hurting system. In these cases, rebooting usually fails anyway or is not the quick fix people expect. Plan for other contingencies.