SLAs Are Just Promises You Plan to Break

I’ve been writing software since before the internet had web pages. Back then, an SLA was simple: “It works when I’m in the office.” Nobody asked questions. Nobody filed tickets. If the system was down at 3 AM, that was God’s problem, not mine.

Now we have SLAs, SLOs, SLIs, error budgets, burn rates, toil metrics, and an entire industry of people who get paid to measure how broken things are instead of fixing them. Progress.

Let me tell you how Service Level Agreements actually work, from someone who has violated approximately 340 of them.

What an SLA Really Is

An SLA is a document that says: “We promise the system will work X% of the time, and if it doesn’t, we’ll do something vague as compensation.”

The key insight: the compensation is always less than the pain caused. Nobody has ever gone bankrupt from SLA credits. But many engineers have gone bald from honoring them.

Here’s a table I’ve compiled over 47 years of ignoring agreements:

SLA Claim	What It Means	What Actually Happens
99.9% uptime	Down 8.7 hours/year	Down during every demo
99.99% uptime	Down 52 minutes/year	Down for 52 minutes on Black Friday
99.999% uptime	Down 5 minutes/year	Down for 5 minutes… in the CEO’s pitch to investors
100% uptime	You’ve been lied to	You’ve definitely been lied to

The math checks out. The human element does not.

Error Budgets: Permission Slips to Fail

The modern SRE movement invented something called an “error budget.” This is the amount of downtime you’re allowed before people start asking questions.

I want to be clear: we invented a budget for failure.

In accounting, budgets are for things you want to spend carefully. Food. Marketing. R&D. We have now applied this framework to breaking production. Wally from the Dilbert comics would be proud — he spent 30 years figuring out how to do nothing and get paid for it. We spent 30 years figuring out how to break things and frame it as a feature.

The natural consequence: engineers now look at an error budget and think, “We haven’t used our downtime allowance this quarter. Let me deploy this risky change before December.”

You built a system that incentivizes controlled chaos. Congratulations on your Six Sigma.

SLOs: SLAs for People Too Busy to Read SLAs

An SLO is an “internal” SLA that your team sets for themselves. Nobody will sue you if you miss it. Your manager will send a disappointed Slack message, which is worse.

Here’s the thing nobody tells you: SLOs are aspirational fiction.

The correct way to set an SLO is:

Look at your current uptime
Subtract 0.1%
Call it your “target”

You will always meet this SLO. Everyone will congratulate you. You will get promoted. The system hasn’t improved at all, but now you have a dashboard showing you’re “meeting your objectives,” and dashboards are what matter.

# The SLO setting algorithm, from first principles
current_uptime = calculate_actual_uptime()
slo_target = current_uptime - 0.1  # Ambitious but achievable

# If you failed last quarter, use this instead:
slo_target = last_quarter_actual - 0.5  # Even more achievable

print(f"Our SLO target is {slo_target}%")
print("We are committed to reliability.")
print("(Source: trust me)")

Incident Response Playbooks: Documentation for Problems You Caused

The height of SLA culture is the incident response playbook — a document that explains what to do when everything is on fire.

I’ve been on-call since on-call was a pager the size of a brick. In 47 years, I have followed an incident response playbook exactly zero times. Here’s why:

When things break, the playbook is never for this specific way things broke
The person who wrote the playbook left the company
Nobody can find the playbook
The wiki is down (see: incident)

The correct incident response playbook is:

Step 1: Panic
Step 2: Blame the last deploy
Step 3: Rollback the last deploy
Step 4: Panic again when rollback doesn't help
Step 5: Restart everything
Step 6: Check if it was a DNS issue (it was)
Step 7: Mark incident resolved
Step 8: Write a post-mortem blaming DNS
Step 9: Never actually fix the DNS issue

This covers 94% of all incidents. The remaining 6% involve cloud providers, and there’s nothing you can do about those anyway. XKCD agrees.

Uptime SLAs in Practice: A Retrospective

Let me share some real wisdom about what “X nines of uptime” means in practice.

Three nines (99.9%): You get 8.7 hours of downtime per year. This sounds like a lot until your system goes down during a product launch, a board presentation, and the company all-hands — all within the same fiscal quarter.

Four nines (99.99%): 52 minutes per year. Teams that advertise four nines usually achieve this by defining “downtime” very creatively. Slow responses don’t count. Errors under 5% don’t count. Database timeouts don’t count if you squint.

Five nines (99.999%): Five minutes per year. Systems that claim this either have nothing happening on them, or they have a marketing team that’s better at their job than your engineers are at theirs.

Six nines (99.9999%): Reserved for systems that have never been observed to fail because they’ve never been used.

The Honest SLA Template

After decades of industry experience, I present the only honest SLA ever written:

“The system will be available when it feels like it, subject to change without notice, for reasons that will be explained in a post-mortem you won’t read. Credits will be issued in the form of a 5% discount on next month’s invoice. The vendor is not responsible for: cosmic rays, misconfigured BGP routes, someone’s backhoe hitting fiber somewhere in Ohio, a junior engineer’s ‘quick change,’ Leap Day, or any event described as ‘unprecedented.’”

Sign here. Initial here. The legal team says this is enforceable in 12 jurisdictions.

The Real Solution

The honest truth is that reliable systems come from caring about reliability, not from writing agreements about it.

But caring about reliability requires:

Proper monitoring (expensive)
Runbook maintenance (boring)
Architecture that degrades gracefully (difficult)
Engineers who sleep (apparently optional)
Actually reading the post-mortems (nobody does)

Much easier to write an SLA, add a 9 to the percentage, and let the incident manager worry about it.

I’ve been a senior engineer for 47 years. My systems have an uptime SLA of 99.9999%, if you calculate it starting from when I last rebooted the server, excluding the past seven incidents, and only during business hours in UTC+3.

That’s not a lie. That’s just creative measurement.

The author has violated 340 SLAs across 12 companies and 6 countries. Legal is still catching up. The monitoring system that would detect his current SLA violations is currently down.