Utilizing “Jitter” to Avoid Operational Bitter

I came across Amazon’s Builder Library article discussing ‘Retries, Timeouts, Jitter’ and thought the Jitter section deserves a closer look to be able to explain exactly the Whys and Hows of Jitter utilization in Distributed Systems with some code examples

Basil A.
5 min readNov 11, 2023

Let’s say we have a thousand client applications hitting every start of the minute on some back-end application. Meaning at every first second of the minute a thousand requests take place at the same time while the remaining 59 seconds of the minute there is no traffic. This traffic spike will highly likely affect the backend and result in a significant portion of the requests returning “503 Service Unavailable” responses due to the high-load on the backend. So how can we address such problem?

One obvious solution would be to scale up the back-end compute resources to accommodate for the spike of a thousand requests on the same second. However, this will unnecessarily increase cost when we have already another attractive solution that can be utilized without any additional scaling or cost requirements; the solution simply requires that all clients sleep for a random number of seconds from 0 to 60 seconds before initiating their connection request which allows distributing the request load across the whole minute instead of having a 1-second spike of thousand requests every minute that is knocking the backend servers down. This technique of adding a random sleep from the client side is named “Jitter” and rids us of the spike of a thousand requests on one second to be eased into 16 requests per second (1000 requests divided by 60 seconds).

What are the use-cases for Jitter?

Use-cases for Jitter that would be beneficial involve periodic jobs taking place on specific intervals, examples could be clients running jobs everyday at 12 afternoon and midnight. Other use-cases involve adding Jitter to connection retry logic, usually retries are backed off for some constant time before retrying, this constant back-off time could incorporate a Jitter delay to avoid clients retrying at the same time.

In essence, any in-flight synchronized swarm behavior of client requests hitting on a backend is undesirable and — to allow efficient resource utilization — the synchronized requests should become desynchronized, this can be achieved by adding jitter random delays to each client request.

Show Me The Code

Here’s a simply python code showing how a method can be written to generate a jitter number of seconds:

# Use this method to generate a random number of seconds to be used
# in the sleep method call before running the request
def calculate_jitter(max_jitter):
jitter = random.uniform(0, max_jitter)
return jitter

Now let’s try out the method:

# Call the method several times to generate random jitter 
# values between 0 and 60 seconds

for i in range(0,5):
jitter = calculate_jitter(ip_address, max_jitter)
print(f"Jitter value of {jitter} seconds")

Output
======
Jitter value of 22.506912382348325 seconds
Jitter value of 44.338073323819 seconds
Jitter value of 43.17004817723227 seconds
Jitter value of 52.178563740974404 seconds
Jitter value of 8.81804446628742 seconds

Why random.uniform(..)?
random.uniform(..) is desirable in Jitter calculations since it uses the uniform random distribution which guarantees that the random delays are spread evenly across the 60 seconds in our example. You should always use a uniform distribution in Jitter to allow equal spread and avoid random values that utilize other forms of random distribution types.

“Random Jitter” versus “Consistent Jitter”

Although I have shown how to generate Random Jitter values in the above example, Marc Brooker’s article recommends avoiding Random Jitter and instead using Consistent Jitter, which simply means instead of returning a ‘different’ random value everytime calculate_jitter is called, we should return the ‘same’ random value within the scope of a specific client host.

How can “Consistent Jitter” be achieved code-wise?

This can be achieved by using `random.seed()` and passing a unique client identifier like the hostname or ip address. This will force the random generator to return the same random value whenever called.

Code example:

import socket
import random
import struct
import hashlib

# Calculate Jitter using the ip_address as the seed
def calculate_jitter(ip_address, max_jitter):
seed = get_ip_seed(ip_address)
random.seed(seed)
# Generate a jitter value between 0 and max_jitter
jitter = random.uniform(0, max_jitter)
return jitter

# The method just returns an integer hash from the ip_address passed
# to be used as a random seed
def get_ip_seed(ip_address):
# Convert IP address to a consistent byte format
packed_ip = socket.inet_pton(socket.AF_INET if '.' in ip_address else socket.AF_INET6, ip_address)

# Hash the packed IP address
hashed_ip = hashlib.sha256(packed_ip).digest()

# Take the first 8 bytes of the hash and unpack them to an integer
# This produces a 64-bit integer from the hash
seed = struct.unpack("!Q", hashed_ip[:8])[0]

Now let’s run the code from two client machines, the first client is 192.168.1.1:

# Calling method 3 times to demonstrate how the same random value is always
# returned within the client
print("Calling several times for 192.168.1.1 returns same value which is desired")
for i in range(0,3):
jitter = calculate_jitter("192.168.1.1", max_jitter)
print(f"Jitter value of {jitter} seconds")


Output:
=======

Calling several times for 192.168.1.1 returns same value which is desired
Jitter value of 56.93017370138931 seconds
Jitter value of 56.93017370138931 seconds
Jitter value of 56.93017370138931 seconds

The second client host is 192.168.2.2 and returns the same value of “34.5” seconds even when calculate_jitter is called multiple times.

print("Calling several times for 192.168.2.2 returns same value which is desired")
for i in range(0,3):
jitter = calculate_jitter("192.168.2.2", max_jitter)
print(f"Jitter value of {jitter} seconds")

Output:
=======

Calling several times for 192.168.2.2 returns same value which is desired
Jitter value of 34.45937924251295 seconds
Jitter value of 34.45937924251295 seconds
Jitter value of 34.45937924251295 seconds

Notice above that for the same client ip address, the same random value is returned no matter how many times we call the method. This is desired behavior to achieve “Consistent Jitter”.

But then, why use “Consistent Jitter” when “Random Jitter” is working well?

The reason Consistent Jitter is preferred is for operational purposes; it makes trouble shooting easier since each client host is jittering at the same delayed interval every time while maintaining randomness across all hosts. So when a host is down or behaving abnormally, it is easier to spot it in the metrics and logs.

To clarify further, let’s say we have three hosts A, B and C running every minute. “Consistent Jitter” will make A jitter with (let’s say) 17 seconds, B will jitter for 26 seconds and C will jitter for 35 seconds. They will do the same jitter values every minute. So when C is down we will be easily able to identify this in the metrics since the metric at XX:35 won’t be showing every minute.

In contrast, the “Random Jitter” won’t have the same effect since host C will be using a different jitter value every minute.

What’s the Take Away

Adding Jitter for high numbers of synchronized client requests is an efficient approach to desynchronize them for better resource utilization and a cost-efficient one too since it spares us from up-scaling backend infrastructure unnecessarily. The above also showed code examples on how Jitter is applied and the two types of Jitter (Random and Consistent). I won’t ruminate on which Jitter type to use, one can simply start utilizing Random Jitter immediately and if required move to Consistent Jitter later on.

References

Timeouts, retries, and backoff with jitter — Marc Brooker
https://aws.amazon.com/builders-library/timeouts-retries-and-backoff-with-jitter/

--

--

Basil A.

A Software Engineer with interests in System Design and Software Engineering done right.