Tools: Retry Strategies Compared: Constant vs Exponential Backoff vs Jitter in Go (With Simulation)

Tools: Retry Strategies Compared: Constant vs Exponential Backoff vs Jitter in Go (With Simulation)

Source: Dev.to

The Problem ## The Four Strategies ## 1. Constant Retry ## 2. Exponential Backoff ## 3. Full Jitter ## 4. Decorrelated Jitter ## The Simulation Setup ## Results ## Constant Retry ## Exponential Backoff ## Full Jitter ## Decorrelated Jitter ## When to Use What ## Try It Yourself ## References Your server goes down. There are 1000 clients already waiting. What happens when it comes back? Picture a server handling 200 requests per second. A dependency goes down for about 10 seconds; every in-flight request fails, and all clients start retrying. When your server recovers, all 1000 clients retry at once. This is called the thundering herd problem, and your retry strategy determines whether your server recovers in seconds or drowns under a wave of redundant/unnecessary requests. Most enginners knows they should use exponential backoff, but fewer know it can still causes syncronized spikes. Even fewer have seen what decorrelated jitter actually does to the request distribution curve. I built a simulation of 1000 concurrent Go clients, a server with fixed capacity, and four retry strategies fighting for recovery, to test each strategy. The simulation tries to show a scenario that production systems face: fixed server capacity, a hard outage window, and clients that keep retrying until they succeed. Every request is tracked per-second, so you can see the thundering herd form, peak, and (hopefully) dissipate. Each strategy is a function with the same signature: given the attempt number and the previous delay, return how long to sleep before the next retry. Fixed delay, no backoff. Formula: delay = 1ms (constant) The worst case retry pattern: a tiny fixed delay with no backoff. It is what happens when engineers write a quick retry loop without thinking about backoff Double the delay on each attempt, starting at 100ms and capped at 10 seconds. Formula: delay = min(base * 2^attempt, cap) Better than constant retry, but there is a catch. All 1000 clients started at the same time, so they all hit attempt 1 at t=100ms, attempt 2 at t=300ms, attempt 3 at t=700ms. Randomize the delay between 0 and the exponential backoff value. Formula: delay = random(0, min(base * 2^attempt, cap)) This is one of the strategies analyzed in the AWS Architecture Blog, where it showed the fewest total calls. Because each client picks a random delay, they no longer retry at the same time. Instead of 1000 clients all hitting the server at once, they arrive spread out. Each delay is random between base and 3 * previous_delay. Formula: delay = random(base, prev * 3) (capped) This one comes from the same AWS blog post. Instead of using the attempt number, each delay depends on the previous delay. Why 3? It is just a practical choice from AWS. The multiplier controls how fast delays grow. With 3, the average delay grows about 1.5x per retry (the midpoint of random(base, prev*3)). That's enough to spread clients out without making them wait too long. You could use 2 or 4 and it would still work, just with different tradeoffs. The server counts how many requests it gets each second. If it's still in the outage window or already at capacity, it says no. Each client is a goroutine that keeps retrying until it gets a success: Parameters for all runs: We count every request per second and draw it as a bar chart in the terminal. You can literally see the thundering herd. Each client retries every 1ms. With 1000 clients, the server gets hundreds of thousands of requests per second, but it can only handle 200. Over 8 million wasted requests to serve 1000 clients. Also, the flood doesn't stop when the server comes back it keeps going for 4 more seconds. It only stops because every time a client gets served, it stops retrying. Fewer clients means fewer retries, and by second 14 all 1000 are done. Look at the histogram: spikes at around seconds 0, 1, 3, 6, 12, 22, 32, 42, 52 with nothing in between. All 1000 clients double their delay at the same rate, so they all retry at the same time. The gaps between spikes grow but the problem stays, you can see that every spike is a mini thundering herd. Wasted requests dropped from 8 million to just 9,000. But the p99 latency is 52 seconds. No spikes. The histogram shows a smooth curve going down from ~5000 at second 0 to 13 at second 19. Each client picks a random delay, so they stop hitting the server not all at once. Same smooth shape as full jitter, no spikes. A bit more noise though, 10695 wasted requests vs 8468, and 137 requests over capacity. Why? If a client randomly gets a short delay, the next one is based on that short delay, so it can keep retrying fast for a bit. Full jitter won every metric here. But we're testing the worst case, all 1000 clients break at once. That doesn't really happen in prod. Clients fail at different times, and decorrelated jitter is better at that. Each client figures out its own rhythm instead of everyone following the same retry schedule. Constant retry: Never in production. Exponential backoff without jitter: Only when you have very few clients, the synchronization problem does not manifest if there is no herd. Full jitter: The default choice. Best results in this simulation and the simplest to implement. Decorrelated jitter: Similar performance, but it works better in scenarios where clients don't all fail at the same time. Each client adapts its delay based on its own history, which handles staggered and rolling failures more naturally. Clone the repo and run each strategy: Each run takes 15-60 seconds, depending on the strategy. The ASCII histograms are generated live from the simulation Templates let you quickly answer FAQs or store snippets for re-use. Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment's permalink. Hide child comments as well For further actions, you may consider blocking this person and/or reporting abuse CODE_BLOCK: type Strategy func(attempt int, prevDelay time.Duration) time.Duration Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: type Strategy func(attempt int, prevDelay time.Duration) time.Duration CODE_BLOCK: type Strategy func(attempt int, prevDelay time.Duration) time.Duration CODE_BLOCK: func constantRetry(_ int, _ time.Duration) time.Duration { return 1 * time.Millisecond } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: func constantRetry(_ int, _ time.Duration) time.Duration { return 1 * time.Millisecond } CODE_BLOCK: func constantRetry(_ int, _ time.Duration) time.Duration { return 1 * time.Millisecond } COMMAND_BLOCK: func exponentialBackoff(attempt int, _ time.Duration) time.Duration { d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt))) if d > capSleep { d = capSleep } return d } Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: func exponentialBackoff(attempt int, _ time.Duration) time.Duration { d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt))) if d > capSleep { d = capSleep } return d } COMMAND_BLOCK: func exponentialBackoff(attempt int, _ time.Duration) time.Duration { d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt))) if d > capSleep { d = capSleep } return d } COMMAND_BLOCK: func fullJitter(attempt int, _ time.Duration) time.Duration { d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt))) if d > capSleep { d = capSleep } return time.Duration(rand.Int63n(int64(d))) } Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: func fullJitter(attempt int, _ time.Duration) time.Duration { d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt))) if d > capSleep { d = capSleep } return time.Duration(rand.Int63n(int64(d))) } COMMAND_BLOCK: func fullJitter(attempt int, _ time.Duration) time.Duration { d := time.Duration(float64(baseSleep) * math.Pow(2, float64(attempt))) if d > capSleep { d = capSleep } return time.Duration(rand.Int63n(int64(d))) } COMMAND_BLOCK: func decorrelatedJitter(_ int, prev time.Duration) time.Duration { if prev < baseSleep { prev = baseSleep } d := time.Duration(rand.Int63n(int64(prev)*3-int64(baseSleep))) + baseSleep if d > capSleep { d = capSleep } return d } Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: func decorrelatedJitter(_ int, prev time.Duration) time.Duration { if prev < baseSleep { prev = baseSleep } d := time.Duration(rand.Int63n(int64(prev)*3-int64(baseSleep))) + baseSleep if d > capSleep { d = capSleep } return d } COMMAND_BLOCK: func decorrelatedJitter(_ int, prev time.Duration) time.Duration { if prev < baseSleep { prev = baseSleep } d := time.Duration(rand.Int63n(int64(prev)*3-int64(baseSleep))) + baseSleep if d > capSleep { d = capSleep } return d } COMMAND_BLOCK: type Server struct { mu sync.Mutex capacity int downFor time.Duration start time.Time requests map[int]int // second -> request count accepted map[int]int // second -> accepted count } Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: type Server struct { mu sync.Mutex capacity int downFor time.Duration start time.Time requests map[int]int // second -> request count accepted map[int]int // second -> accepted count } COMMAND_BLOCK: type Server struct { mu sync.Mutex capacity int downFor time.Duration start time.Time requests map[int]int // second -> request count accepted map[int]int // second -> accepted count } CODE_BLOCK: func clientLoop(srv *Server, strategy Strategy, metrics *Metrics) { start := time.Now() attempt := 0 prevDelay := baseSleep for { if srv.Do() { metrics.Record(time.Since(start), attempt) return } delay := strategy(attempt, prevDelay) prevDelay = delay attempt++ time.Sleep(delay) } } Enter fullscreen mode Exit fullscreen mode CODE_BLOCK: func clientLoop(srv *Server, strategy Strategy, metrics *Metrics) { start := time.Now() attempt := 0 prevDelay := baseSleep for { if srv.Do() { metrics.Record(time.Since(start), attempt) return } delay := strategy(attempt, prevDelay) prevDelay = delay attempt++ time.Sleep(delay) } } CODE_BLOCK: func clientLoop(srv *Server, strategy Strategy, metrics *Metrics) { start := time.Now() attempt := 0 prevDelay := baseSleep for { if srv.Do() { metrics.Record(time.Since(start), attempt) return } delay := strategy(attempt, prevDelay) prevDelay = delay attempt++ time.Sleep(delay) } } COMMAND_BLOCK: git clone https://github.com/RafaelPanisset/retry-strategies-simulator cd retry-strategies-simulator go run main.go -strategy=constant go run main.go -strategy=backoff go run main.go -strategy=jitter go run main.go -strategy=decorrelated Enter fullscreen mode Exit fullscreen mode COMMAND_BLOCK: git clone https://github.com/RafaelPanisset/retry-strategies-simulator cd retry-strategies-simulator go run main.go -strategy=constant go run main.go -strategy=backoff go run main.go -strategy=jitter go run main.go -strategy=decorrelated COMMAND_BLOCK: git clone https://github.com/RafaelPanisset/retry-strategies-simulator cd retry-strategies-simulator go run main.go -strategy=constant go run main.go -strategy=backoff go run main.go -strategy=jitter go run main.go -strategy=decorrelated - 1000 clients, all starting simultaneously - Server capacity: 200 req/s - Outage duration: 10 seconds - Server rejects requests during outage and when at capacity - Marc Brooker, "Exponential Backoff and Jitter" - Marc Brooker, "Timeouts, retries, and backoff with jitter" - Google Cloud, "Retry Strategy"