The Real Meaning of SLIs, SLOs, and Error Budgets

"We need 99.99% uptime!"

I've heard this demand from stakeholders countless times. It sounds reasonable. Who doesn't want five nines of reliability?

But here's what that actually means:

99.99% uptime = 52 minutes of allowed downtime per year
That's about 4.3 minutes per month
Or about 1 minute per week

Can you deploy a critical security patch, troubleshoot a database issue, and handle an unexpected traffic spike—all in under 4 minutes per month?

No? Then you don't actually want 99.99% uptime. You want something else.

This is where SLIs, SLOs, and error budgets come in. They're not buzzwords. They're a framework for having honest conversations about reliability vs velocity.

Let me explain what they really mean and why they matter.

SLI: Service Level Indicator

An SLI is a metric that matters to your users.

Not CPU usage. Not disk I/O. Not how pretty your dashboard looks.

What do your users actually care about?

Bad SLI Examples:

"Server CPU under 80%"
"Database queries under 100ms"
"Fewer than 10 errors per minute"

These are implementation details. Your users don't care about your CPU usage. They care about whether your service works.

Good SLI Examples:

"Percentage of HTTP requests that return in under 200ms"
"Percentage of login attempts that succeed"
"Percentage of search queries that return results in under 1 second"

Notice the pattern? Good SLIs are:

User-centric: Measured from the user's perspective
Measurable: You can actually track them
Meaningful: They correlate with user satisfaction

The SLI Formula

Most SLIs follow this pattern:

SLI = (Good Events / Total Events) × 100%

Example: If you handled 10,000 requests and 9,950 succeeded:

SLI = (9,950 / 10,000) × 100% = 99.5%

Choosing Your SLIs

Start with the basics. For most web services:

Availability: Can users reach you?
- "% of requests that return a 2xx or 3xx status code"
Latency: Are you fast enough?
- "% of requests that complete in under 200ms"
Quality: Does it work correctly?
- "% of API calls that return correct results"

You don't need dozens of SLIs. Start with 3-5 that matter most to your users.

SLO: Service Level Objective

An SLO is your target for an SLI.

If your SLI measures success rate, your SLO says: "We want 99.9% success rate."

Setting SLOs

Here's the secret: Your SLO should be based on what users actually need, not what sounds impressive.

Ask:

What reliability do users actually need?
What's the business impact of missing that target?
What can we realistically achieve and maintain?

Example SLOs:

For a critical payment system:

Availability: 99.95% of payment requests succeed
Latency: 99% of payment requests complete in under 1 second

For a social media feed:

Availability: 99.5% of feed loads succeed
Latency: 95% of feeds load in under 2 seconds

Notice the payment system has tighter targets? That's intentional. Different services need different reliability levels.

The Trap of Too-High SLOs

Here's a mistake I see constantly: setting SLOs at the highest level you've ever achieved.

"We were 99.95% available last quarter, so let's make that our SLO!"

Bad idea. Here's why:

You had good luck: No major incidents, no big deployments, no unexpected traffic spikes
You can't maintain it: Achieving something once doesn't mean you can sustain it
You'll block all changes: Any risky work (like major refactors) threatens your SLO
You'll burn out your team: Unrealistic targets create constant stress

Better approach: Set your SLO slightly below your historical performance. This gives you room to take risks, deploy changes, and handle incidents without constantly being out of SLO.

Error Budget: The Game Changer

Error budgets are where SRE gets really interesting.

If your SLO is 99.9% success rate, that means:

99.9% of requests should succeed
0.1% can fail

That 0.1% is your error budget. It's how much unreliability you can tolerate.

Why Error Budgets Matter

They transform the reliability conversation from emotional to mathematical.

Without error budgets:

"We can't deploy today, it's too risky!"
"We need to slow down and focus on reliability!"
"We can't try this new feature, it might break things!"

With error budgets:

"We've used 30% of our error budget this month. We can take reasonable risks."
"We've used 95% of our error budget. Feature freeze until we improve reliability."
"We have error budget to spare. Let's try that risky refactor."

Calculating Error Budgets

If your SLO is 99.9% over 30 days:

Total requests: 10,000,000 Allowed failures: 10,000,000 × 0.1% = 10,000 failed requests

That's your error budget.

If you have an incident that causes 5,000 failed requests, you've consumed 50% of your error budget.

Using Error Budgets

This is where it gets powerful.

Scenario 1: You have error budget remaining

You can:

Deploy more aggressively
Try risky refactors
Experiment with new features
Push changes on Friday (controversial, but the math supports it)

Scenario 2: You've exhausted your error budget

You must:

Freeze feature launches
Focus on reliability work
Fix the issues causing failures
Improve testing and deployment processes

This removes the political argument. The data decides.

Error Budget Policies

Define ahead of time what happens at different budget levels:

Example policy:

90-100% budget remaining: Full speed ahead. Ship features, take risks.
50-90% budget remaining: Normal operations. Balance features and reliability.
10-50% budget remaining: Caution. Increase review of risky changes.
0-10% budget remaining: Feature freeze. All hands on reliability.
Budget exhausted: Hard stop on features until reliability improves.

This gives teams clear guidance without emotional debates.

Putting It All Together

Let's walk through a real example.

Service: E-commerce checkout

SLI: Percentage of checkout requests that complete successfully in under 3 seconds

SLO: 99.9% over a 30-day rolling window

Measurement period: Last 30 days

Total checkout attempts: 5,000,000
Successful completions under 3s: 4,995,000
Current SLI: 99.9%

Error Budget:

Allowed failures: 5,000,000 × 0.1% = 5,000
Actual failures: 5,000
Budget consumed: 100%

Decision: Feature freeze. Focus on reliability.

Action items:

Identify why 5,000 checkouts failed or were slow
Fix the root causes
Improve monitoring to detect issues faster
Add automated testing to prevent regressions
Don't lift feature freeze until current SLI improves

Two Weeks Later

Measurement period: Last 30 days (new window)

Total checkout attempts: 5,200,000
Successful completions under 3s: 5,197,400
Current SLI: 99.95%

Error Budget:

Allowed failures: 5,200,000 × 0.1% = 5,200
Actual failures: 2,600
Budget consumed: 50%

Decision: Resume feature work. We're back in good standing.

See how the math removes emotion? No arguments about "feeling" like things are stable. The data shows we're meeting our SLO.

Common Mistakes

Mistake 1: Too Many SLIs

You don't need to measure everything. Pick 3-5 metrics that truly matter to users.

Too many SLIs = diluted focus and alert fatigue.

Mistake 2: Internal SLIs

"Our database query time is under 50ms"—that's not an SLI. Users don't care about your database.

Focus on the user experience, not internal implementation details.

Mistake 3: SLOs Without Error Budgets

Setting SLOs is good. Using error budgets to make decisions is where the real power comes from.

Don't stop halfway.

Mistake 4: Not Enforcing Error Budget Policy

If you hit 0% error budget and keep shipping features anyway, you've defeated the purpose.

The policy only works if you actually follow it.

Mistake 5: Setting SLOs You Can't Measure

If you can't accurately measure your SLI, you can't track your SLO or error budget.

Invest in instrumentation first.

Starting Your SLI/SLO Journey

Here's how to begin:

Week 1: Identify Your SLIs

Answer: "What do users care about most?"

Pick 3-5 metrics. Start simple.

Week 2: Start Measuring

Get the data flowing. You need historical data to set realistic SLOs.

Week 3: Analyze Historical Performance

What's your actual reliability over the past 3 months? Don't cherry-pick your best period.

Week 4: Set Initial SLOs

Set them slightly below your historical average. Give yourself room to operate.

Week 5: Calculate and Track Error Budgets

Build dashboards that show error budget consumption in real-time.

Week 6: Define Your Error Budget Policy

What happens at 90%, 50%, 10%, and 0% budget? Document it.

Month 2 and Beyond: Iterate

Adjust SLOs based on what you learn. Too tight? Loosen them. Too loose? Tighten them.

This is a journey, not a destination.

The Bigger Picture

SLIs, SLOs, and error budgets are more than metrics. They're a framework for:

Honest conversations about reliability vs velocity
Data-driven decisions instead of politics
Empowering teams to take appropriate risks
Balancing user happiness with engineering velocity

They replace "Should we deploy this?" arguments with "Do we have error budget?" math.

They transform reliability from a vague goal into a measurable practice.

Conclusion

SLIs measure what users care about. SLOs set targets for those measurements. Error budgets tell you how much room you have to take risks.

Together, they create a framework for building reliable systems without sacrificing velocity.

Start small. Measure what matters. Set realistic targets. Use error budgets to make decisions.

And remember: 100% reliability is the wrong goal. The right goal is the level of reliability your users need, achieved at a sustainable pace your team can maintain.

That's the real meaning of SLIs, SLOs, and error budgets.

How does your team handle the reliability vs velocity tradeoff? Are you using SLOs and error budgets? Let me know your experience or questions.