← Back to Blog
Cloud Architecture2025-01-03

How Small Cloud Changes Create Big Outages

A single line configuration change took down our entire production system. Here's what happened and the lessons learned about change management in cloud environments.

How Small Cloud Changes Create Big Outages

"It's just a small change. What could go wrong?"

Famous last words in production. I've said them. You've probably said them. And we've both learned the hard way that in cloud systems, there's no such thing as a "small" change.

Let me tell you about the time a one-line configuration change took down our entire production system for 45 minutes during peak traffic.

The Change That Broke Everything

We were running a Kubernetes cluster on AWS EKS. The cluster was stable, handling thousands of requests per second without issues. We wanted to optimize costs by enabling cluster autoscaling more aggressively.

The change? One line in the cluster autoscaler configuration:

# Before
scale-down-delay-after-add: 10m

# After
scale-down-delay-after-add: 1m

We decreased the delay from 10 minutes to 1 minute. The idea was to scale down faster when traffic dropped, saving money on unused nodes.

We tested it in staging. It worked fine. We rolled it out to production.

Within 15 minutes, everything was on fire.

What Actually Happened

Here's the cascade of failures that one-line change triggered:

Stage 1: The Scale-Down

Traffic dropped after a busy period (normal daily pattern). The autoscaler, now much more aggressive, immediately started terminating nodes.

No problem so far. This is exactly what we wanted.

Stage 2: The Pod Rescheduling

When nodes terminate, Kubernetes reschedules the pods to other nodes. Our pods had anti-affinity rules to spread across availability zones. This meant each terminated pod had to find a node in a specific zone.

During the reschedule, our database connection pools momentarily spiked. Still not a problem—connection pools are designed for this.

Stage 3: The Hidden Bottleneck

Here's where things got interesting. We had a Redis cluster used for session storage. Each pod maintained a connection to Redis.

When dozens of pods rescheduled simultaneously, they all tried to establish new Redis connections at the same time. Redis hit its maxclients limit.

New connections were refused. Pods couldn't start. They failed their readiness checks. Kubernetes kept trying to reschedule them, creating even more connection attempts.

Stage 4: The Cascade

Now we had pods failing to start, which increased load on the remaining healthy pods. Those pods started timing out on Redis operations. More pods failed health checks. The autoscaler saw resource pressure and tried to scale up. New pods came online and immediately tried to connect to Redis.

The entire system was in a death spiral.

Why This Happened

Looking back, the failure had several contributing factors:

1. Hidden Resource Limits

We knew about Redis's maxclients limit. We thought we'd sized it appropriately for our peak load. What we didn't account for was the thundering herd problem during rapid rescheduling.

Lesson: Don't just size for steady-state. Size for worst-case surge scenarios.

2. Aggressive Timing

The 1-minute scale-down delay was too aggressive. It didn't give the system enough time to stabilize before potentially scaling down again. Pods were terminating while their replacements were still initializing.

Lesson: Slower is often safer in distributed systems. Don't optimize for speed at the cost of stability.

3. Lack of Circuit Breakers

When Redis started refusing connections, our pods just kept retrying. No backoff. No circuit breaker. They hammered Redis until Kubernetes gave up and restarted them.

Lesson: Always implement circuit breakers and exponential backoff for external dependencies.

4. Testing Didn't Match Production

Our staging environment had far fewer pods than production. We never hit the Redis connection limit in testing because we never generated enough concurrent connection attempts.

Lesson: Load testing needs to simulate not just steady-state load, but also worst-case operational scenarios like rapid scaling events.

The Fix

We fixed it in layers:

Immediate (During the Incident)

  1. Reverted the configuration change: Back to the 10-minute delay
  2. Manually scaled up Redis: Increased the maxclients limit
  3. Restarted the most-affected pods: In controlled batches to avoid another thundering herd

Short-term (That Week)

  1. Added connection pooling middleware: Limited concurrent connections per pod
  2. Implemented circuit breakers: Using Istio service mesh
  3. Set pod disruption budgets: Limited how many pods could be rescheduled simultaneously

Long-term (Following Month)

  1. Rewrote our autoscaling strategy: Used custom metrics instead of just CPU/memory
  2. Implemented connection rate limiting: On both application and infrastructure levels
  3. Enhanced monitoring: Added metrics for connection pool exhaustion, Redis connection counts, and pod scheduling latency
  4. Created automated load tests: That simulate scaling events, not just steady traffic

Lessons Learned

This incident taught us several critical lessons about cloud changes:

1. Small Changes Aren't Small

In distributed systems, changing one value can have ripple effects across the entire architecture. Every change is potentially big.

2. Test Operations, Not Just Features

We tested that autoscaling worked. We didn't test what happens when autoscaling and pod rescheduling happen aggressively during a connection surge.

3. Understand Your Limits

We knew Redis had a connection limit. We didn't understand how that limit interacted with our deployment strategy.

4. Layers of Defense

No single fix would have prevented this. We needed circuit breakers, rate limiting, pod disruption budgets, and better monitoring. Defense in depth matters.

5. Rollback is a Feature

The fastest way to fix a production issue caused by a change is to revert the change. Design for easy rollbacks.

Preventing Similar Issues

Here's what I recommend for avoiding these kinds of failures:

Before Making Changes

  • Map dependencies: Understand what your change touches, including indirect dependencies
  • Define success criteria: What metrics indicate this change is working correctly?
  • Plan rollback: Know exactly how to undo the change before you make it
  • Test operationally: Simulate the operational impact, not just the functional behavior

During Rollout

  • Roll out gradually: Use canary deployments or blue-green deployments
  • Monitor actively: Don't just deploy and walk away. Watch the metrics.
  • Set time-based rollback: If you haven't verified success within X minutes, auto-rollback
  • Have a rollback champion: One person's only job is watching for issues and calling rollback if needed

After Deployment

  • Watch for delayed effects: Some issues don't appear immediately
  • Document what changed: Future you will thank present you
  • Update runbooks: If you learned something new, add it to the runbooks
  • Share learnings: Post-mortems aren't just for big outages

Conclusion

Cloud systems are complex. Small changes can have big impacts. But you can minimize risk by:

  • Understanding your system's limits and dependencies
  • Testing operational scenarios, not just happy paths
  • Implementing defense in depth (circuit breakers, rate limiting, disruption budgets)
  • Rolling out changes gradually with clear rollback plans
  • Learning from every incident, no matter how small

That "small" one-line change taught us more about our system than months of smooth operation. It made us better engineers and our system more resilient.

The next time you're about to deploy a "small" change, remember: in production, every change is big. Treat it that way.


Have you had a "small change" cause a big outage? I'd love to hear your story. Contact me or share in the comments.