Machine learning powering ops monitoring
Machine learning powering ops monitoring

I’m an old fashioned Ops guy at heart. I came up from a company who had a big, loud, cold datacentre, whiring and flashing all day and all night. That kind of thing still does it for me too. If you’ve got a steady workload and the facility and the staff to keep a menagerie of servers happy, good on you.

But that’s not the way the IT landscape is today for most folks. Now we’re all about microservices, we’re about running the skinniest, most load-balanced, redundant kit we can. Where ten years ago we might have one huge database server for the whole world’s users of our app, now we have one closest to each major group of users. But looking after lots of smaller services can have its tribulations too.

Tending the flock

If you ever work with me, (and you totally should) you’ll hear me talking about your IT assets being cattle, not pets. When a node is taking too long to respond, we do the kindest thing and replace it with a younger, healthy node. But how do we know when it’s time to retire old yeller? How do we know when we’re seeing more 404 errors than we should? How long is worryingly long for a database write transaction?

Performance is a relative term isn’t it. We’ve all seen slow response times to log into the company systems on a Monday morning. We all know that if we’re at work past 10PM, the backups are going to eat all the network bandwidth. And we all know that when James is on his fourth coffee by 9:43, his project is going sideways!

The metric who cried wolf

This is the cool bit. This is the bit where us ops guys get to use a smattering or machine learning to solve all of these problems. Well okay, we can’t help James with his project… yet.

Let’s say you have a nightly deploy of your app at 03:00 every morning. We know latency spikes, so we always ignore that CloudWatch alarm. Not very helpful. But what about if we just increase threshold for the alarm so that it takes a bigger spike in latency to alert us? Well now we could have poor user performance at random times through the day, out users are fed up, and unless we’re keeping a close watch of our dashboards, we don’t know about it.

AWS CloudWatch anomaly detection uses artificial intelligence to look back over a given metric’s output over the past 2 weeks, and decide what an alarm state should look like. So we no longer receive erroneous alarms at 03:02 every morning during the nightly deploy, because CloudWatch has learned that’s when to expect higher latency, just for a few minutes. And yet, we can keep a tight watch over that metric at all other times.

Unknown unknowns

These are super-simple examples of what can be achieved with machine learning, but we can go further. What about streaming all of your web service logs into CloudWatch Logs and using machine learning to see when you might be under attack? What about analysing your memory consumption to identify when you might have leaks, or worse, when something is executing that you didn’t want? Anomaly detection can help us with so much, and save us from having to know our systems as intimately as we otherwise might need to. It all helps us to get to a utopia where services look after themselves, and are replaced automatically when required.

You can learn more about AWS CloudWatch Anomaly detection here; https://docs.aws.amazon.com/AmazonCloudWatch/latest/monitoring/CloudWatch_Anomaly_Detection.html. And you can talk to us about how we can deploy this in your environment by contacting us here.

Menu