Translating Failures into Service-Level Objectives

We’ll be walking back from failure to reliability by using SLOs, using examples from three recent outages.

Oct 10th, 2023 10:24am by Adriana Villela and Ana Margarita Medina

Featued image for: Translating Failures into Service-Level Objectives

Image from Raywoo from Shutterstock.

Let’s face it, nobody loves to fail. But failures are our best teachers, and if we let them, can turn into great opportunities for learning, so we can be successful in the future.

Today we’ll be walking back from failure to reliability by using service-level objectives, also known as SLOs, using examples from three recent outages that affected many of us.

But before we dive in, let’s do a quick little refresher on SLOs and SLO practices. If you’d like to get a more in-depth look into them, we invite you to check out our intro post on SLOs.

SLO Refresher

SLOs are one of the tools that we have embraced from site reliability engineering (SRE) that allows us to measure reliability over time. Before we set an SLO, we have to set an SLI, or service-level indicator.

An SLI is a metric, a thing that you measure. More specifically, it’s a two-dimensional metric that answers the question, “What should I be measuring and observing?”

SLIs are the building blocks of SLOs. SLOs help answer the question, “What is the reliability goal of this service?”

When working with SLOs, you should keep the following in mind:

Create SLOs with SLIs that are as close to customer impact as possible.
Not everything has to be an SLO.
Don’t create too many SLOs.
Make SLOs living, breathing things that we create.
Make SLOs actionable. This is accomplished by tying them to your observability data.

The Outages

Now that we’ve had our quick refresher, let’s dig into our outages. But first, it’s important to point out that we are not by any means picking on these failures or on the engineers who worked on these incidents. If anything, we want to send out massive #HugOps to anyone who has worked on these outages or knows the pain of carrying a pager. 💜

Our goal today is to show how we can learn from failure and how you can use incidents to improve your system’s overall reliability.

OK, let’s get started!

Outage 1: X (Twitter), March 6, 2023

In this outage, users were presented with the following error message, “Your current API plan does not include access to this endpoint, please see https://developer.twitter.com/en/docs/twitter-api for more information.” Yikes.

Outage 2: Spotify, January 2023

In this outage, users started experiencing issues with Spotify functionality. The issue was triggered by scheduled maintenance of GitHub for Enterprises (GHE), causing its internal DNS resolvers to fail. Once GHE was brought back up, the invalid configuration was applied to the DNS resolvers. When their DNS resolvers got an invalid configuration, their system entered the dreaded crashLoopBackOff state and was unable to serve responses to internal DNS queries. This then led to cascading failures across its tools, including internal tooling, which led to a longer triage time.

Read more about the Spotify outage here.

Since we’re having GitHub config issues, why not create this SLO:

99.99% successful access to config from GitHub in the last 28 days.

Or perhaps this one:

99.99% availability to internal tooling API in the last 28 days.

Unfortunately, neither of the above is customer-facing, and it doesn’t reflect the actual issue. So a better SLO would be:

99.99% successful valid configurations sent to DNS resolvers in the last 28 days.

It is important to remember that when we experience an outage, we must revisit our SLOs and make sure that outage covered the SLOs —≠ was the SLO breached? If the SLO was not breached, it means that engineers would not have been alerted. This means that this was not a good enough SLO, and we must therefore go back to the drawing board.

Outage 3: The Reddit Pi Day Outage

This outage occurred when Reddit engineers attempted to upgrade one of their Kubernetes clusters. Shortly after the upgrade, engineers noticed that site traffic had come to a halt. As they began troubleshooting, they noticed that their pods were taking an extremely long time to start and stop and that their container images were taking a very long time to pull.

In addition, someone noticed that they were getting a lot of timeouts in the API server logs for write operations, but not specifically on the “writes” themselves. Instead, the timeouts were happening while attempting to call the admission controllers on the cluster.

The team eventually decided to restore from backup, but even that was not without its own set of challenges. The backup instructions were out of date, and among other things, they were initially unable to apply TLS certificates, due to a hostname mismatch.

Read about the Reddit outage here.

We might be tempted to create SLOs that look like this:

99.99% availability for API server in the last 28 days.

And this:

99.99% successful connections to DNS server in the last 28 days.

But you may have noticed by now, that these are not the SLOs that we are looking for. First off, we don’t want to have too many SLOs.

Secondly, as far as the first SLO above is concerned, checking the availability of the API server wasn’t a very good indicator of the underlying problem.

This quote from Alex Hidalgo best sums it up:

“If someone performs a request call against the API and they receive a response in a timely manner and free of errors, they still are not going to be happy if they can’t understand what that response is.”
— Hidalgo, Alex, “Implementing Service Level Objectives,” O’Reilly Media.

Lastly, you don’t want to create an SLO for our DNS server. That should come from your cloud provider or DNS provider and is something that would be on their plate, not yours.

A better SLO would be what Reddit actually has in place right now:

99.95% availability on overall services in the last 28 days.

An SLO shouldn’t care what failed; you just know that you weren’t able to make the request. The “how” is explained by your telemetry, such as your traces, metrics and logs.

This outage also reminds us of two practices from DevOps/SRE that we should always be practicing:

Slow (phased) rollouts and canary deployments
Keep your docs up to date

Having good rollout practices in place, ensuring that your systems are observable and having up-to-date runbooks really make a difference when trying to bring your systems back up, whether it’s due to outage or maintenance.

Where Do We Go from Here?

We can be proactive about failure and creating SLOs from chaos engineering and game days.

Chaos engineering is the discipline of experimenting on a system to build confidence in the system’s capability to withstand turbulent conditions in production. We want to inject failure into our systems to see how it would react if this failure were to happen on its own. This allows us to learn from failure, document it and prepare for failures like it. We can start practicing these types of experiments with game days.

A game day is a time when your team or organization comes together to do chaos engineering. This can look different for each organization, depending on the organization’s maturity and architecture. These can be in the form of tabletop exercises, open source/internal tooling such as Chaos Monkey or LitmusChaos, or vendors like Harness Chaos Engineering or Gremlin. No matter how you go about starting this practice, you can get comfortable with failure and continue to build a culture of embracing failure at your organization. This also allows you to continue checking on those SLOs we just set up.

Final Thoughts

We would like to remind folks that y’all can learn from incidents from other companies like the ones we have presented here today. There are incident databases like thevoid.community, which you can use to find and learn about more failures, to help future-proof your own organization.

Now go forth and SLO on!

This blog post is based on our SLOConf 2023 talk of the same name. Be sure to check it out.

Adriana Villela is a Sr. Developer Advocate at ServiceNow Cloud Observability (formerly Lightstep). She helps companies achieve reliability greatness through observability, SRE, and DevOps practices. Previously, she managed a Platform Engineering team and an Observability Practices team at Tucows. Adriana...