Architect for Failure

I’m lucky enough to be part of lots of business and cloud communities, both online and away from a keyboard. It’s a great way to keep your skills sharp and absorb other people’s views and ideas. Recently, there was a thread in one community asking about the availability statistics for an AWS service. I forget which one, and it doesn’t really matter either, as I shall explain here. Because, you see, Cloud Uptime Is Not A Number. We must architect for failure.

Everything Fails, All The Time

Dr Werner Vogels famously coined this phrase, which is a staple of how my team and I architect environments to this day. Everything fails, all the time.

What Dr Vogels is talking about is the failure of an individual component of a system, or a particular service. The quote isn’t designed to fill you with dread, or to make you feel as though AWS is a terrible cloud platform. Rather it is evidence that realistic expectations are set, right from the top of the AWS technical tree. Things are going to go wrong, and we should all be prepared for that eventuality.

What About Service Level Agreements?

Service Level Agreements, or SLAs, are the way service providers reassure their customers that the service they’re paying for is of a given standard. For example, the SLA for AWS Elastic Compute Cloud (EC2) is 99.99% – four nines. That means that they can have just shy of 44 minutes of downtime per month, and there’s nothing you can do about it. The SLA didn’t promise you any more than that.

But wait, there’s, err, less. That 99.99% availability is for a region, not an availability zone. AWS are saying you will be able to get an EC2 instance to run, somewhere in a region, for 99.99% of the time. But what happens if you only have one EC2 instance in one availability zone? What if you don’t have an auto-scaling policy to start a server up somewhere else if yours fail? You’re out of luck, that’s what!

All this isn’t to say that AWS are pulling a fast one, or that you shouldn’t trust that AWS can handle your production workloads. No, they’re just setting out realistic expectations of what it in their power to deliver. The rest is up to you.

And this is where the AWS Well Architected Framework comes in. Specifically, the availability pillar. Because things will fail, you should keep this in mind when designing your cloud infrastructure. You have that one key legacy server which can’t be interrupted and can’t be scaled out to work across multiple servers? That’s still your problem.

How about using AWS Backup to hook into Data Lifecycle Manager, snapshot the server ever 12 hours, make an image out of it, and use the image in a Launch Configuration? That’s a recipe I used for a client with exactly those requirements. So now, if the server goes down, the target group will report an unhealthy host and the autoscaling group will trigger a new server to be spun up. That auto-scaling group will deploy an EC2 instance as per the Launch Configuration, and the launch configuration will only be as old as the last server snapshot.

So you can see that even the most old-fashioned services can be brought, kicking and screaming, into the new world of highly available cloud. Failures are routine in all IT environments, but they are embraced in the cloud.

SLA Payback

And even if a cloud provider does fail their SLA, all you get is a partial bill credit. Imagine your EC2 spend is $1000 per month, and there is 1 hour of downtime for your environment at 0900 on a Monday morning. Does that $100 credit compensate you and your business adequately?

And again, I’m not trying to just say mean things about cloud providers. What I’m trying to do is point out the agreements that we all enter into with the likes of AWS, Google Cloud and Microsoft Azure.

The Wrong Question

Hopefully this illustrates why asking about the availability of one AWS service is probably the wrong question. We’re picking on EC2 here because it’s the easiest one to get to grips with. There’s a good chance a lot of you reading this have been in charge of making sure a servers in a data centre keep blinking their lights, and spinning their disks. We know how hard is is to achieve!

If you imagine you had your own data centre, with redundant power, connectivity and on and on, are you saying nothing could go wrong? Of course not. A comet might hit your building! Make a plan for that comet, even when you don’t have to care about the building because you use the cloud. Plan, and test your plans. And please architect for failure.

Tagged Service Level Agreement, SLA, Uptime