Timeouts, Retries, and Backoff With Jitter

# Timeouts, Retries, and Backoff With Jitter ![rw-book-cover](https://a0.awsstatic.com/libra-css/images/logos/aws_logo_smile_1200x630.png) ## Metadata - Author: [[Amazon Web Services, Inc.]] - Full Title: Timeouts, Retries, and Backoff With Jitter - Category: #articles - Summary: Failures can occur when one service calls another, leading to delays and potential resource exhaustion. To manage this, clients use timeouts, retries, and backoff strategies to avoid overwhelming the system. By incorporating randomness (jitter) in retries, systems can reduce congestion and improve overall reliability. - URL: https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/ ## Highlights - *Retries* allow clients to survive these random partial failures and short-lived transient failures by sending the same request again. ([View Highlight](https://read.readwise.io/read/01jk1wqevx0ac7b205n2qe1y89)) - It's not always safe to retry. A retry can increase the load on the system being called, if the system is already failing because it’s approaching an overload. To avoid this problem, we implement our clients to use *backoff*. This increases the time between subsequent retries, which keeps the load on the backend even ([View Highlight](https://read.readwise.io/read/01jk1wsfahm7zswmj8yargdfm2)) - To avoid this problem, we employ *jitter*. This is a random amount of time before making or retrying a request to help prevent large bursts by spreading out the arrival rate. ([View Highlight](https://read.readwise.io/read/01jk1wv8q1edj9ayrzbs7r203q)) - Typically, the most difficult problem is choosing a timeout value to set. Setting a timeout too high reduces its usefulness, because resources are still consumed while the client waits for the timeout. Setting the timeout too low has two risks: • Increased traffic on the backend and increased latency because too many requests are retried. • Increased small backend latency leading to a complete outage, because all requests start being retried. ([View Highlight](https://read.readwise.io/read/01jk1xbxsrk6q94x042pej2x8y)) - Retries are “selfish.” In other words, when a client retries, it spends more of the server's time to get a higher chance of success. Where failures are rare or transient, that's not a problem. This is because the overall number of retried requests is small, and the tradeoff of increasing apparent availability works well. When failures are caused by overload, retries that increase load can make matters significantly worse. They can even delay recovery by keeping the load high long after the original issue is resolved. ([View Highlight](https://read.readwise.io/read/01jk1xf8cyge2zc2q9fwq5g1by)) - Retries are similar to a powerful medicine -- useful in the right dose, but can cause significant damage when used too much. Unfortunately, in distributed systems there's almost no way to coordinate between all of the clients to achieve the right number of retries. ([View Highlight](https://read.readwise.io/read/01jk1xfm42jt3k7t2fmswgb3cm)) - The preferred solution that we use in Amazon is a *backoff*. Instead of retrying immediately and aggressively, the client waits some amount of time between tries. The most common pattern is an *exponential backoff,* where the wait time is increased exponentially after every attempt. Exponential backoff can lead to very long backoff times, because exponential functions grow quickly. ([View Highlight](https://read.readwise.io/read/01jk1xghbeqbrvkaa8yqwqah09)) - If all the failed calls back off to the same time, they cause contention or overload again when they are retried. Our solution is jitter. Jitter adds some amount of randomness to the backoff to spread the retries around in time ([View Highlight](https://read.readwise.io/read/01jk1xp7am82k4z7aehfj8gnfj)) - In distributed systems, transient failures or latency in remote interactions are inevitable. Timeouts keep systems from hanging unreasonably long, retries can mask those failures, and backoff and jitter can improve utilization and reduce congestion on systems. ([View Highlight](https://read.readwise.io/read/01jk1xqqf4bk1890byfsw35eyq)) - Retries can amplify the load on a dependent system. If calls to a system are timing out, and that system is overloaded, retries can make the overload worse instead of better. We avoid this amplification by retrying only when we observe that the dependency is healthy. We stop retrying when the retries are not helping to improve availability. ([View Highlight](https://read.readwise.io/read/01jk1xrhc8evt1tev4wsq0yw2n)) - Amazon, we design our systems to tolerate and reduce the probability of failure, and avoid magnifying a small percentage of failures into a complete outage. To build resilient systems, we employ three essential tools: timeouts, retries, and backoff. ([View Highlight](https://read.readwise.io/read/01jk1wjnnafwrpxehgma424q4v))