Our take on Amazon Web Services’ outage on Tuesday
AS YOU MAY HAVE HEARD, AWS HAD A MAJOR OUTAGE ON THEIR S3 PLATFORM IN THE US (US-EAST-1) REGION EARLIER IN THE WEEK, CAUSING A SURPRISINGLY LARGE AMOUNT OF THE INTERNET TO BREAK
A few of the sites taken offline include Docker’s Registry Hub, Trello, Travis CI, GitHub and GitLab, Quora, Medium, Signal, Slack, Imgur, Twitch.tv, Razer, heaps of publications that stored images and other media in S3, Adobe’s cloud, Zendesk, Heroku, Coursera, Bitbucket, Autodesk’s cloud, Twilio, Mailchimp, Citrix, Expedia, Flipboard, and Yahoo! Mail (list stolen from this Register article). So this was a big and well-publicised issue, and rightly so.
I’ve seen quite a lot of posts on social media with varying levels of hilarity, bullshit and FUD (fear, uncertainty, doubt) in them.
All hosting providers have outages. Fact. I ran a hosting company for 12 years so I know this first-hand. Whenever computers and humans mix, there’s never going to be such a thing as an un-downable/unbreakable system.
PLANNING TO FAIL
That said, a properly architected public cloud solution shouldn’t have been majorly impacted by such a failure. Things like replicating S3 buckets to another region would have reduced the impact of the outage to your application.
Properly architected cloud solutions should be designed for failure, which means spanning multiple availability zones (AZs) and regions when required.
The cloud increases – rather than eliminates – the need for a well-designed system architecture.
However, like anything there’s always a balance of cost vs uptime so it’s not as simple as saying ‘you should’ve designed something that was more resilient’.
We all should know that the decision to stick a single server in a ‘traditional’ hosting company’s datacentre without any replication to another facility, is going to result in an outage at some point, but again, it’s a cost/business decision. And that’s the same with multi-region replication. It’s about balancing business needs and cost.
And if your plan following this outage is to move to Google Cloud or Azure, then you’re barking up the wrong tree.
The major cloud providers will generally give you as much redundancy as you’re willing to pay for. It’s up to you to decide how much you need, and work with a good architect to ensure you’re getting your money’s worth.