Best Practices for Retry
A retry is a mechanism that monitors a request, and on the detection of failure automatically fires a repeat of the request. A retry of a request should only be considered when there is a chance of success- for example, in response to error codes 500 (Internal Server Error) and 503 (Service Unavailable). Attempting to retry on any and every error is inadvisable as it wastes resources and can produce unexpected results.
Examples of errors that would not warrant retry include 400 (Bad Request) and 401 (Unauthorized), as these (and all 4xx errors) are client errors- meaning that any retry attempt will not result in a successful response without the client altering their request.
In determining the viability of a retry for a request that has resulted in an error, one should consider that the results of requests are not being tracked- thus there is no mechanism to prevent requests from being sent to that same broken host again and again. As there may be multiple requests from our service and potentially requests from other services, it is not guaranteed that the second request attempt will be routed to a working host. This can result in a waste of resources as well as a negative user experience.
There is a helpful equation to aid in determining your max-retry setting- the chance of routing to a working host is equal to the number of working hosts divided by the total number of hosts. For example, if there were two hosts and the first request attempt returns a 5xx error- the chance of this occurring on the first request attempt was 50% — if the request is retried, the chance of this first retry routing to a bad host is also 50%. After this first retry, the chance of routing to a bad host is 25% (50% x 50%), decreasing by half on each additional retry.
This concept can aid in increasing the efficiency and efficacy of your retry mechanism. However, as previously mentioned — not all failed requests should be retried. In addition to considering the class and type of error returned, one should ensure that the error information presented makes sense as well as that the retry mechanism is able to dynamically respond to different errors.
In a recent interview I was introduced to concept of idempotence. In terms of a RESTful service call, if a client is able to make the same request multiple times and receive the same result each time (with the result being the same as if the request were made once) it is idempotent. To get more specific, HTTP methods that are idempotent are those that (surprise, surprise) are able to be called many times with the same outcome. For example, POST is not an idempotent method, as calling it multiple times does not produce the same result every time. On the other hand, GET is an idempotent method as the server is only retrieving a resource (not posting or updating), thus having no effect on the resource and so it is able to be performed over and over with the same result.
Now, why is this relevant to retries? If a client wants to create a resource through a POST request but receives a server timeout and the request is retried- with POST not being idempotent this could result in multiple POST requests if the timeout error occurred during the response and not the request.
For example, a client is purchasing a concert ticket, they input their payment information and click submit- triggering an HTTP POST request. This is a very popular artist (perhaps Beyoncé), and so right when the ticket sale goes live millions of clients are attempting to connect to the server in purchasing tickets.
Our client’s request results in a 503 error (Service Unavailable). In this situation the error could be a result of an overloaded server (in which case the POST request wouldn’t go through), OR the 503 error could have originated from the response on the client side. A 503 error does not indicate from where the error resulted. If the client’s POST request did go through (the timeout occurred during the sending of the response back to the client instead of during the sending of the request to the server) and a retry occurs this will result in the client’s payment option being charged twice [and an angry customer]. Identifying whether or not the request is idempotent is crucial information in determining the configuration of a retry mechanism.
In thinking of and then writing out this example, I was wondering how something like this is prevented. During my research I came across the concept of a cryptographic nonce. If a retry in a case such as my example is possible, being able to identify duplicate transactions is essential. Although the concept is much more complex than I will be covering in this blog post, a cryptographic nonce is a number appended to a request that allows the server to detect duplicate requests.
Coming back to our Beyoncé concert example, let’s say the client’s POST request includes a nonce. The request again results in a 503 error, the timeout is coming from the response side, and the ticket was successfully purchased. If the request is retried and directed to another upstream service host, that same nonce is supplied along with the client’s input information. This service host uses the attached nonce, in addition to the other identifying information within the request, to figure out whether both requests originated from the same client request and should be treated as one instead of two separate requests. So, as intended the client only purchases one ticket.
Having knowledge of error codes, the HTTP Request-Response Cycle, and an understanding of retry flow will allow you to implement and configure a retry mechanism that enhances the user experience instead of detracting from it.
In my next post I will be taking a deep dive into the HTTP Request-Response Cycle!