The Real Meaning of SLIs, SLOs, and Error Budgets
SLIs, SLOs, and error budgets sound like corporate buzzwords. But when you understand them properly, they transform how you balance reliability with velocity.
The Real Meaning of SLIs, SLOs, and Error Budgets
"We need 99.99% uptime!"
I've heard this demand from stakeholders countless times. It sounds reasonable. Who doesn't want five nines of reliability?
But here's what that actually means:
- 99.99% uptime = 52 minutes of allowed downtime per year
- That's about 4.3 minutes per month
- Or about 1 minute per week
Can you deploy a critical security patch, troubleshoot a database issue, and handle an unexpected traffic spike—all in under 4 minutes per month?
No? Then you don't actually want 99.99% uptime. You want something else.
This is where SLIs, SLOs, and error budgets come in. They're not buzzwords. They're a framework for having honest conversations about reliability vs velocity.
Let me explain what they really mean and why they matter.
SLI: Service Level Indicator
An SLI is a metric that matters to your users.
Not CPU usage. Not disk I/O. Not how pretty your dashboard looks.
What do your users actually care about?
Bad SLI Examples:
- "Server CPU under 80%"
- "Database queries under 100ms"
- "Fewer than 10 errors per minute"
These are implementation details. Your users don't care about your CPU usage. They care about whether your service works.
Good SLI Examples:
- "Percentage of HTTP requests that return in under 200ms"
- "Percentage of login attempts that succeed"
- "Percentage of search queries that return results in under 1 second"
Notice the pattern? Good SLIs are:
- User-centric: Measured from the user's perspective
- Measurable: You can actually track them
- Meaningful: They correlate with user satisfaction
The SLI Formula
Most SLIs follow this pattern:
SLI = (Good Events / Total Events) × 100%
Example: If you handled 10,000 requests and 9,950 succeeded:
SLI = (9,950 / 10,000) × 100% = 99.5%
Choosing Your SLIs
Start with the basics. For most web services:
-
Availability: Can users reach you?
- "% of requests that return a 2xx or 3xx status code"
-
Latency: Are you fast enough?
- "% of requests that complete in under 200ms"
-
Quality: Does it work correctly?
- "% of API calls that return correct results"
You don't need dozens of SLIs. Start with 3-5 that matter most to your users.
SLO: Service Level Objective
An SLO is your target for an SLI.
If your SLI measures success rate, your SLO says: "We want 99.9% success rate."
Setting SLOs
Here's the secret: Your SLO should be based on what users actually need, not what sounds impressive.
Ask:
- What reliability do users actually need?
- What's the business impact of missing that target?
- What can we realistically achieve and maintain?
Example SLOs:
For a critical payment system:
- Availability: 99.95% of payment requests succeed
- Latency: 99% of payment requests complete in under 1 second
For a social media feed:
- Availability: 99.5% of feed loads succeed
- Latency: 95% of feeds load in under 2 seconds
Notice the payment system has tighter targets? That's intentional. Different services need different reliability levels.
The Trap of Too-High SLOs
Here's a mistake I see constantly: setting SLOs at the highest level you've ever achieved.
"We were 99.95% available last quarter, so let's make that our SLO!"
Bad idea. Here's why:
- You had good luck: No major incidents, no big deployments, no unexpected traffic spikes
- You can't maintain it: Achieving something once doesn't mean you can sustain it
- You'll block all changes: Any risky work (like major refactors) threatens your SLO
- You'll burn out your team: Unrealistic targets create constant stress
Better approach: Set your SLO slightly below your historical performance. This gives you room to take risks, deploy changes, and handle incidents without constantly being out of SLO.
Error Budget: The Game Changer
Error budgets are where SRE gets really interesting.
If your SLO is 99.9% success rate, that means:
- 99.9% of requests should succeed
- 0.1% can fail
That 0.1% is your error budget. It's how much unreliability you can tolerate.
Why Error Budgets Matter
They transform the reliability conversation from emotional to mathematical.
Without error budgets:
- "We can't deploy today, it's too risky!"
- "We need to slow down and focus on reliability!"
- "We can't try this new feature, it might break things!"
With error budgets:
- "We've used 30% of our error budget this month. We can take reasonable risks."
- "We've used 95% of our error budget. Feature freeze until we improve reliability."
- "We have error budget to spare. Let's try that risky refactor."
Calculating Error Budgets
If your SLO is 99.9% over 30 days:
Total requests: 10,000,000 Allowed failures: 10,000,000 × 0.1% = 10,000 failed requests
That's your error budget.
If you have an incident that causes 5,000 failed requests, you've consumed 50% of your error budget.
Using Error Budgets
This is where it gets powerful.
Scenario 1: You have error budget remaining
You can:
- Deploy more aggressively
- Try risky refactors
- Experiment with new features
- Push changes on Friday (controversial, but the math supports it)
Scenario 2: You've exhausted your error budget
You must:
- Freeze feature launches
- Focus on reliability work
- Fix the issues causing failures
- Improve testing and deployment processes
This removes the political argument. The data decides.
Error Budget Policies
Define ahead of time what happens at different budget levels:
Example policy:
- 90-100% budget remaining: Full speed ahead. Ship features, take risks.
- 50-90% budget remaining: Normal operations. Balance features and reliability.
- 10-50% budget remaining: Caution. Increase review of risky changes.
- 0-10% budget remaining: Feature freeze. All hands on reliability.
- Budget exhausted: Hard stop on features until reliability improves.
This gives teams clear guidance without emotional debates.
Putting It All Together
Let's walk through a real example.
Service: E-commerce checkout
SLI: Percentage of checkout requests that complete successfully in under 3 seconds
SLO: 99.9% over a 30-day rolling window
Measurement period: Last 30 days
- Total checkout attempts: 5,000,000
- Successful completions under 3s: 4,995,000
- Current SLI: 99.9%
Error Budget:
- Allowed failures: 5,000,000 × 0.1% = 5,000
- Actual failures: 5,000
- Budget consumed: 100%
Decision: Feature freeze. Focus on reliability.
Action items:
- Identify why 5,000 checkouts failed or were slow
- Fix the root causes
- Improve monitoring to detect issues faster
- Add automated testing to prevent regressions
- Don't lift feature freeze until current SLI improves
Two Weeks Later
Measurement period: Last 30 days (new window)
- Total checkout attempts: 5,200,000
- Successful completions under 3s: 5,197,400
- Current SLI: 99.95%
Error Budget:
- Allowed failures: 5,200,000 × 0.1% = 5,200
- Actual failures: 2,600
- Budget consumed: 50%
Decision: Resume feature work. We're back in good standing.
See how the math removes emotion? No arguments about "feeling" like things are stable. The data shows we're meeting our SLO.
Common Mistakes
Mistake 1: Too Many SLIs
You don't need to measure everything. Pick 3-5 metrics that truly matter to users.
Too many SLIs = diluted focus and alert fatigue.
Mistake 2: Internal SLIs
"Our database query time is under 50ms"—that's not an SLI. Users don't care about your database.
Focus on the user experience, not internal implementation details.
Mistake 3: SLOs Without Error Budgets
Setting SLOs is good. Using error budgets to make decisions is where the real power comes from.
Don't stop halfway.
Mistake 4: Not Enforcing Error Budget Policy
If you hit 0% error budget and keep shipping features anyway, you've defeated the purpose.
The policy only works if you actually follow it.
Mistake 5: Setting SLOs You Can't Measure
If you can't accurately measure your SLI, you can't track your SLO or error budget.
Invest in instrumentation first.
Starting Your SLI/SLO Journey
Here's how to begin:
Week 1: Identify Your SLIs
Answer: "What do users care about most?"
Pick 3-5 metrics. Start simple.
Week 2: Start Measuring
Get the data flowing. You need historical data to set realistic SLOs.
Week 3: Analyze Historical Performance
What's your actual reliability over the past 3 months? Don't cherry-pick your best period.
Week 4: Set Initial SLOs
Set them slightly below your historical average. Give yourself room to operate.
Week 5: Calculate and Track Error Budgets
Build dashboards that show error budget consumption in real-time.
Week 6: Define Your Error Budget Policy
What happens at 90%, 50%, 10%, and 0% budget? Document it.
Month 2 and Beyond: Iterate
Adjust SLOs based on what you learn. Too tight? Loosen them. Too loose? Tighten them.
This is a journey, not a destination.
The Bigger Picture
SLIs, SLOs, and error budgets are more than metrics. They're a framework for:
- Honest conversations about reliability vs velocity
- Data-driven decisions instead of politics
- Empowering teams to take appropriate risks
- Balancing user happiness with engineering velocity
They replace "Should we deploy this?" arguments with "Do we have error budget?" math.
They transform reliability from a vague goal into a measurable practice.
Conclusion
SLIs measure what users care about. SLOs set targets for those measurements. Error budgets tell you how much room you have to take risks.
Together, they create a framework for building reliable systems without sacrificing velocity.
Start small. Measure what matters. Set realistic targets. Use error budgets to make decisions.
And remember: 100% reliability is the wrong goal. The right goal is the level of reliability your users need, achieved at a sustainable pace your team can maintain.
That's the real meaning of SLIs, SLOs, and error budgets.
How does your team handle the reliability vs velocity tradeoff? Are you using SLOs and error budgets? Let me know your experience or questions.