When dealing with networked or external data sources I've learned the hard way that all code should be designed as to expect failures. It is significantly cheaper and easier to bake a graceful handling of errors this from the start rather than attempt to do it later on. A common first step in combating failures and providing resiliency is to have your logic retry an operation in case of a failure.
But how should you retry?
Simple retries
The most simplistic retry mode is to simply surround all your code with a while loop that executes your block a predefined number of times, ala:
int retry = 0;
do
{
// Operation
if( true == MyFlakyOperation() )
break;
}
while ( ++retry < 6 )
6> The problem with this approach is that it completely ignores the most likely underlying reason for the failure. Congestion or resource load on the remote end could be causing your calls (and many others) to intermittently fail as the server cannot handle the incoming requests. In this case your naive implementation might actually be contributing to making the situation even worse.
So how do we solve this?
Spacing out retries
One common approach to spacing out retries is called exponential backoff. This algorithm uses a predefined feedback (e.g. retry count) to systematically increase wait times between repeated executions of the same code to avoid congestion.
The idea is that with every additional retry that is required it is more likely that the system we're communicating with is heavily congested and needs more "breathing" space to resolve its current backlog of requests.
Example of backoff retries
Below is an example of a very simple C++ algorithm snippet that performs this kind of exponential backoff based on 4sec intervals:
int success = OK;
int retry = 0;
do
{
// Operation
success = MyFlakyOperation();
// Sleep if operation was not success
if (success != OK)
{
int sec = static_cast<int>(std::pow(4, retry));
std::this_thread::sleep_for(std::chrono::seconds(sec));
}
}
while ( ++retry < 6 && success != OK)
6> In this example my algorithm has a maximum running time with full retry count of a whooping 22 min and 44 seconds! (4+16+64+256+1024 = 1364sec).
How much does the waiting time increase?
Care must be taken when choosing the interval to increment by when using a naive approach as my example above. Below is a table listing the waiting times in seconds for each retry for 2-7 second intervals.Remember that your maximum running time is the cumulative waiting numbers for all intervals!
Retry# | 2sec | 3sec | 4sec | 5sec | 6sec | 7sec |
---|---|---|---|---|---|---|
1 | 2 | 3 | 4 | 5 | 6 | 7 |
2 | 4 | 9 | 16 | 25 | 36 | 49 |
3 | 8 | 27 | 64 | 125 | 216 | 343 |
4 | 16 | 81 | 256 | 625 | 1,296 | 2,401 |
5 | 32 | 243 | 1,024 | 3,125 | 7,776 | 16,807 |
6 | 64 | 729 | 4,096 | 15,625 | 46,656 | 117,649 |
7 | 128 | 2,187 | 16,384 | 78,125 | 279,936 | 823,543 |
8 | 256 | 6,561 | 65,536 | 390,625 | 1,679,616 | 5,764,801 |
9 | 512 | 19,683 | 262,144 | 1,953,125 | 10,077,696 | 40,353,607 |
10 | 1,024 | 59,049 | 1,048,576 | 9,765,625 | 60,466,176 | 282,475,249 |
So using 7 sec as a base and allowing up to 10 retries, the total maximum waiting time will be just shy of 10,5 years!
Worth considering...
Developer & Programmer with +15 years professional experience building software.
Seeking WFH, remoting or freelance opportunities.