Failure resiliency and retry delay planning

When dealing with networked or external data sources I've learned the hard way that all code should be designed as to expect failures. It is significantly cheaper and easier to bake a graceful handling of errors this from the start rather than attempt to do it later on. A common first step in combating failures and providing resiliency is to have your logic retry an operation in case of a failure.

But how should you retry?

Simple retries

The most simplistic retry mode is to simply surround all your code with a while loop that executes your block a predefined number of times, ala:

int retry = 0;
  // Operation

  if( true == MyFlakyOperation() )
while ( ++retry < 6 )

The problem with this approach is that it completely ignores the most likely underlying reason for the failure. Congestion or resource load on the remote end could be causing your calls (and many others) to intermittently fail as the server cannot handle the incoming requests. In this case your naive implementation might actually be contributing to making the situation even worse.

So how do we solve this?

Spacing out retries

One common approach to spacing out retries is called exponential backoff. This algorithm uses a predefined feedback (e.g. retry count) to systematically increase wait times between repeated executions of the same code to avoid congestion.

Example of exponential spacing based on 4sec base wait time.<br>The vertical bars indicate retry points.

Example of exponential spacing based on 4sec base wait time.
The vertical bars indicate retry points.

The idea is that with every additional retry that is required it is more likely that the system we're communicating with is heavily congested and needs more "breathing" space to resolve its current backlog of requests.

Example of backoff retries

Below is an example of a very simple C++ algorithm snippet that performs this kind of exponential backoff based on 4sec intervals:

int success = OK;
int retry = 0;
  // Operation

  success = MyFlakyOperation();

   // Sleep if operation was not success

   if (success != OK)
       int sec = static_cast<int>(std::pow(4, retry));
while ( ++retry < 6 && success != OK)

In this example my algorithm has a maximum running time with full retry count of a whooping 22 min and 44 seconds! (4+16+64+256+1024 = 1364sec).

How much does the waiting time increase?

Care must be taken when choosing the interval to increment by when using a naive approach as my example above. Below is a table listing the waiting times in seconds for each retry for 2-7 second intervals.

Remember that your maximum running time is the cumulative waiting numbers for all intervals!

Retry# 2sec 3sec 4sec 5sec 6sec 7sec
1 2 3 4 5 6 7
2 4 9 16 25 36 49
3 8 27 64 125 216 343
4 16 81 256 625 1,296 2,401
5 32 243 1,024 3,125 7,776 16,807
6 64 729 4,096 15,625 46,656 117,649
7 128 2,187 16,384 78,125 279,936 823,543
8 256 6,561 65,536 390,625 1,679,616 5,764,801
9 512 19,683 262,144 1,953,125 10,077,696 40,353,607
10 1,024 59,049 1,048,576 9,765,625 60,466,176 282,475,249

So using 7 sec as a base and allowing up to 10 retries, the total maximum waiting time will be just shy of 10,5 years!

Worth considering...

Software Developer
For hire

Developer & Programmer with +15 years professional experience building software.

Seeking WFH, remoting or freelance opportunities.