The Retry Storm: When Your Resilience Code Causes the Outage

A downstream service gets slow. Not down — slow. Latency climbs from 50ms to 800ms for about ninety seconds, the kind of blip that happens a few times a week and nobody notices.

Except this time the whole system goes down for forty minutes.

The postmortem finds no bug. Every service did exactly what it was configured to do. The retries retried, the timeouts timed out, the health checks checked health. The resilience machinery worked as designed — and the design was the problem.

This is a retry storm, and it's one of the most common ways that the code meant to keep you up is the code that takes you down.

How a blip becomes an outage

Start with the slow service. A dependency it calls gets briefly slow, so its own responses get slow.

Its callers have retries configured — sensible, every resilience guide recommends them. A request takes too long, the timeout fires, the caller retries. Now the slow service is receiving its original traffic plus a wave of retries. Its load just went up while it was already struggling.

More load means more slowness. More slowness means more timeouts. More timeouts mean more retries. The service is now receiving two or three times its normal traffic, all because it got slightly slow and its callers "helpfully" responded by sending more requests.

Each retry holds a connection and a thread or coroutine on the caller's side while it waits. So the callers start exhausting their own connection pools and thread budgets. Now they're slow. Now their callers start retrying. The failure climbs up the dependency graph, one layer at a time, each layer amplifying traffic for the layer below.

Meanwhile the load balancer's health checks are timing out against the slow instances, so it marks them unhealthy and removes them from rotation — concentrating all the traffic onto the few instances still passing checks, which immediately fall over too.

Within a couple of minutes the original ninety-second blip is a full-system outage being actively sustained by every piece of resilience tooling you installed. The dependency that started it recovered long ago. The storm doesn't need it anymore. It feeds on itself.

Why naive retries are the core mistake

Retries make sense for one specific failure: a request that failed for a transient, independent reason. A packet dropped. One instance hiccuped. Retry, and you'll probably hit a healthy path.

The retry logic implicitly assumes the failure is independent of load. That assumption is exactly false during the failure mode that matters. When a service is slow because it's overloaded, a retry doesn't route around the failure — it is more of the failure. You're responding to "this service has too much traffic" by sending it more traffic.

Three configuration mistakes turn this from a risk into a guarantee.

Fixed-interval retries. If everyone retries after exactly one second, the retries arrive in synchronized waves. The service gets hit with a thundering herd at t+1s, t+2s, t+3s. Without jitter, retries cluster instead of spreading.

Retries stacked at every layer. Service A retries 3 times calling B, B retries 3 times calling C. A single user request can become nine requests to C. Retry budgets multiply down the call stack. Three layers of "just 3 retries" is a 27x amplification factor.

No upper bound on concurrent retries. Each service treats its retry budget as a per-request decision. Nothing tracks the aggregate: across all in-flight requests, how much of my outbound traffic is retries right now? Without that number, there's no way to notice the storm forming.

The mechanisms that actually contain it

Resilience isn't "retry on failure." It's "behave correctly when the dependency is unhealthy" — and when a dependency is overloaded, the correct behavior is to send it less traffic, not more.

Exponential backoff with jitter. Don't retry at a fixed interval. Back off exponentially — 1s, 2s, 4s — and add randomness so retries spread across a window instead of arriving in a synchronized wave. This is the single highest-value change, and it's a few lines of code.

Circuit breakers. Track the failure rate to each dependency. When it crosses a threshold, stop calling that dependency entirely for a cooldown period — fail fast locally instead. The breaker gives the struggling service room to recover instead of pinning it under retry load. It also stops you from burning your own threads waiting on calls that are going to fail anyway.

Retry budgets, not retry counts. Cap retries as a fraction of total traffic — "retries may not exceed 10% of outbound requests" — rather than a per-request count. A per-request count has no idea what the rest of the system is doing. A budget does: when many requests are failing at once, the budget is exhausted and the system stops amplifying. Per-request retries fail open under load; budgets fail closed.

Deadline propagation. Pass a deadline through the call chain. If the user-facing request has already spent its 3-second budget, the service four layers deep should not start a fresh set of retries against work whose result will be discarded. Retrying work that nobody is waiting for is pure amplification.

Load shedding. A service that's overloaded should reject excess requests immediately with a clear "try later" signal, not queue them and serve them all slowly. A fast rejection lets the caller's circuit breaker engage. A slow success keeps every caller's thread parked and the storm fed.

The mindset shift

The instinct behind a retry storm is generous: when something fails, try harder. Don't give up on the user. That instinct is right for an independent failure and exactly wrong for an overload failure — and overload is the failure mode that turns blips into outages.

So the question to ask of any resilience mechanism is not "does this help a single request succeed?" It's "what does this do to aggregate load when many requests are failing at the same time?" A mechanism that increases load under widespread failure is not resilience. It's a positive feedback loop wearing a resilience costume.

Test for it directly. In a game day, make a dependency slow — not down, slow — and watch what your own traffic does. If your outbound request rate to the struggling service goes up, you've found a storm waiting to happen. Better to find it on a Tuesday afternoon than at 2am.

Good resilience code makes a struggling system's life easier. Look honestly at yours and make sure it isn't doing the opposite.

How a blip becomes an outage

Why naive retries are the core mistake

The mechanisms that actually contain it

The mindset shift

Related posts

Subscribe to new posts