Understanding Disaster Recovery

It is easy to think of the cloud as an amazing place, where nothing could ever go wrong. While that isn’t true, if you use some more modern Software as a Service (SaaS) products, the bad things are somebody else’s problem to fix. But what about if you, like us, still run lots of virtual machines in the cloud? What would actually happen if one of your cloud provider’s datacentres flooded? Do you have a plan? Today, we’re understanding disaster recovery.

What is a Disaster – in this Context?

That’s actually a really good question, and it might seem obvious on the face of it – but is it? Of course if your cloud provider’s datacentre catches fire, or floods, and your servers happen to be running from that location, you may well lose service. But what if your cloud provider loses network access at one site? What if one of their facilities suffers a security compromise? Are these disasters too?

Well let’s take a step back from the technology, and think about what a disaster for your business looks like. If your online shop can’t take payments for an hour, is that a disaster? If you’re unable to log on to submit that tender submission in time, is that a disaster? For us, if we can’t get get to our online monitoring tools to ensure our clients’ systems are all online and healthy, that is certainly a disaster for us.

So firstly, have a think about the risks which might affect your business. They might be connected with the cloud or technology, but a lot of them won’t be. Make a risk matrix and score the risks based on how likely they are to happen, and the impact if they do happen.

Disaster Mitigation

We’re talking about disaster recovery here, but prevention is always better than cure. In this context, the best thing to do is to have multiple redundant assets in different locations. So if you rely on a Domain Controller running in the cloud to be able to log onto your computer, you should make sure there are at least two of them. Need to make sure that special data is never ever lost, no matter what? Yep, keep a couple of copies.

We covered how Regions and Availability Zones work in our How The Cloud Works blog post last week. It’s almost like we plan this, right?

Disaster Recovery

Planning

This is so square, but you just have to have a plan. There’s no getting away from it. Your plan should include things like;

Who needs to be available to make decissions?
Who needs to be available to handle the technology?
Where is all the information you might need in a disaster situation stored and who can access it?
Are key systems going to be available? Coms? Password Managers? Bastion servers?
How will you communicate the incident with your users and who will do this?

Recovery Point Objective – RPO

A recovery point objective describes how much data you can stand to lose in the event of a disaster situation. You might decide that no data loss is eve acceptable – and this would be the attitude taken by banks the traders. But that degree of resiliance comes at huge cost, and isn’t usually required.

Keep in mind that you’re going to be paying for the replication of your data across the cloud provider’s networks. The more you replicate in the shortest time, the more money it is all going to cost. In the context of the signifficant problems which would constitute a disaster situation for your organisation, decide on a suitable RPO. Once you know what that should be, whether it is 5 minutes or 5 days, you can build a solution to deliver this objective.

Recovery Time Objective – RTO

Recovery Time Objective describes how long is acceptable to get services back online. This number can be vastly different from your RPO.

Again, talk about this in the context of your business. What will be the impact of your online services being offline for half a business day for example? If it would mean that a couple of hundred staff can do no work, but you still have to pay them, it’s probably worth investing in technology and processes which give you a short RTO.

Practice Makes Perfect

An untested DR plan is no DR plan at all. DR plans can be tested in a few different ways, with varying levels of impact on the organisation you’re trying to protect.

For example, have you ever tested that the data you replicate to the other availability zone is intact and works? Why not spin up a replica of your environment in an isolated network, plug in your replicated data, and see what happens?

Why not arange a dry run of the DR plan out of hours? See if everyone actually does have access to everything they need. Check that everyone knows what they should do and who they should report to.

After you have tested all of these pieces in isolation, and you’re happy they perform as expected, then you can schedule a full DR test. It is true that organisations like Amazon and Google don’t interrupt your service to perform their DR tests, but let’s be honest, none of us have their budget do we. So don’t be scared to get buy-in from your clients about DR tests. So long as you get the message right, and reassure them that you do this because you’re a responsible provider, they will appreciate it. Give them input into the planning meetings and make them feel important, and a part of the sucess of the project. Teamwork!

Never Ending Story

We have scratched the surface of Understanding Disaster Recovery planning here, but the truth is that it is a huge topic, and it is never finished. The key thing to remember is that it all related to what’s important for your business. The business should demand the highest standards from the technology available. Know what you want to achieve, and work together to achieve it.

Tagged Disaster Recovery DR