How Small Cloud Changes Create Big Outages
A single line configuration change took down our entire production system. Here's what happened and the lessons learned about change management in cloud environments.
How Small Cloud Changes Create Big Outages
"It's just a small change. What could go wrong?"
Famous last words in production. I've said them. You've probably said them. And we've both learned the hard way that in cloud systems, there's no such thing as a "small" change.
Let me tell you about the time a one-line configuration change took down our entire production system for 45 minutes during peak traffic.
The Change That Broke Everything
We were running a Kubernetes cluster on AWS EKS. The cluster was stable, handling thousands of requests per second without issues. We wanted to optimize costs by enabling cluster autoscaling more aggressively.
The change? One line in the cluster autoscaler configuration:
# Before
scale-down-delay-after-add: 10m
# After
scale-down-delay-after-add: 1m
We decreased the delay from 10 minutes to 1 minute. The idea was to scale down faster when traffic dropped, saving money on unused nodes.
We tested it in staging. It worked fine. We rolled it out to production.
Within 15 minutes, everything was on fire.
What Actually Happened
Here's the cascade of failures that one-line change triggered:
Stage 1: The Scale-Down
Traffic dropped after a busy period (normal daily pattern). The autoscaler, now much more aggressive, immediately started terminating nodes.
No problem so far. This is exactly what we wanted.
Stage 2: The Pod Rescheduling
When nodes terminate, Kubernetes reschedules the pods to other nodes. Our pods had anti-affinity rules to spread across availability zones. This meant each terminated pod had to find a node in a specific zone.
During the reschedule, our database connection pools momentarily spiked. Still not a problem—connection pools are designed for this.
Stage 3: The Hidden Bottleneck
Here's where things got interesting. We had a Redis cluster used for session storage. Each pod maintained a connection to Redis.
When dozens of pods rescheduled simultaneously, they all tried to establish new Redis connections at the same time. Redis hit its maxclients limit.
New connections were refused. Pods couldn't start. They failed their readiness checks. Kubernetes kept trying to reschedule them, creating even more connection attempts.
Stage 4: The Cascade
Now we had pods failing to start, which increased load on the remaining healthy pods. Those pods started timing out on Redis operations. More pods failed health checks. The autoscaler saw resource pressure and tried to scale up. New pods came online and immediately tried to connect to Redis.
The entire system was in a death spiral.
Why This Happened
Looking back, the failure had several contributing factors:
1. Hidden Resource Limits
We knew about Redis's maxclients limit. We thought we'd sized it appropriately for our peak load. What we didn't account for was the thundering herd problem during rapid rescheduling.
Lesson: Don't just size for steady-state. Size for worst-case surge scenarios.
2. Aggressive Timing
The 1-minute scale-down delay was too aggressive. It didn't give the system enough time to stabilize before potentially scaling down again. Pods were terminating while their replacements were still initializing.
Lesson: Slower is often safer in distributed systems. Don't optimize for speed at the cost of stability.
3. Lack of Circuit Breakers
When Redis started refusing connections, our pods just kept retrying. No backoff. No circuit breaker. They hammered Redis until Kubernetes gave up and restarted them.
Lesson: Always implement circuit breakers and exponential backoff for external dependencies.
4. Testing Didn't Match Production
Our staging environment had far fewer pods than production. We never hit the Redis connection limit in testing because we never generated enough concurrent connection attempts.
Lesson: Load testing needs to simulate not just steady-state load, but also worst-case operational scenarios like rapid scaling events.
The Fix
We fixed it in layers:
Immediate (During the Incident)
- Reverted the configuration change: Back to the 10-minute delay
- Manually scaled up Redis: Increased the
maxclientslimit - Restarted the most-affected pods: In controlled batches to avoid another thundering herd
Short-term (That Week)
- Added connection pooling middleware: Limited concurrent connections per pod
- Implemented circuit breakers: Using Istio service mesh
- Set pod disruption budgets: Limited how many pods could be rescheduled simultaneously
Long-term (Following Month)
- Rewrote our autoscaling strategy: Used custom metrics instead of just CPU/memory
- Implemented connection rate limiting: On both application and infrastructure levels
- Enhanced monitoring: Added metrics for connection pool exhaustion, Redis connection counts, and pod scheduling latency
- Created automated load tests: That simulate scaling events, not just steady traffic
Lessons Learned
This incident taught us several critical lessons about cloud changes:
1. Small Changes Aren't Small
In distributed systems, changing one value can have ripple effects across the entire architecture. Every change is potentially big.
2. Test Operations, Not Just Features
We tested that autoscaling worked. We didn't test what happens when autoscaling and pod rescheduling happen aggressively during a connection surge.
3. Understand Your Limits
We knew Redis had a connection limit. We didn't understand how that limit interacted with our deployment strategy.
4. Layers of Defense
No single fix would have prevented this. We needed circuit breakers, rate limiting, pod disruption budgets, and better monitoring. Defense in depth matters.
5. Rollback is a Feature
The fastest way to fix a production issue caused by a change is to revert the change. Design for easy rollbacks.
Preventing Similar Issues
Here's what I recommend for avoiding these kinds of failures:
Before Making Changes
- Map dependencies: Understand what your change touches, including indirect dependencies
- Define success criteria: What metrics indicate this change is working correctly?
- Plan rollback: Know exactly how to undo the change before you make it
- Test operationally: Simulate the operational impact, not just the functional behavior
During Rollout
- Roll out gradually: Use canary deployments or blue-green deployments
- Monitor actively: Don't just deploy and walk away. Watch the metrics.
- Set time-based rollback: If you haven't verified success within X minutes, auto-rollback
- Have a rollback champion: One person's only job is watching for issues and calling rollback if needed
After Deployment
- Watch for delayed effects: Some issues don't appear immediately
- Document what changed: Future you will thank present you
- Update runbooks: If you learned something new, add it to the runbooks
- Share learnings: Post-mortems aren't just for big outages
Conclusion
Cloud systems are complex. Small changes can have big impacts. But you can minimize risk by:
- Understanding your system's limits and dependencies
- Testing operational scenarios, not just happy paths
- Implementing defense in depth (circuit breakers, rate limiting, disruption budgets)
- Rolling out changes gradually with clear rollback plans
- Learning from every incident, no matter how small
That "small" one-line change taught us more about our system than months of smooth operation. It made us better engineers and our system more resilient.
The next time you're about to deploy a "small" change, remember: in production, every change is big. Treat it that way.
Have you had a "small change" cause a big outage? I'd love to hear your story. Contact me or share in the comments.