1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Developing for Autoscaling on Azure

Transient Faults: Determining an appropriate retry count and interval

Start course

Develop your skills for autoscaling on Azure with this course from Cloud Academy. Learn how to improve your teams and development skills and understand how they relate to scalable solutions. What's more, in this course you can analyze and execute how to deal with transient faults.

This Course is made up of 19 lectures that will guide you through the process from beginning to end. 

To discover more Azure Courses visit our content training library.

Learning Objectives

  • Learn how to develop applications for autoscale
  • Prepare for the Azure AZ-303 certification
  • Design and Implement code that addresses singleton application instances


Intended Audience

This course is recommended for:

  • IT Professionals preparing for Azure certification
  • IT Professionals that need to develop applications that can autoscale


There are no prior requirements necessary in order to do this training course, although an understanding of MS Azure will prove helpful



When determining an appropriate retry count and interval, it's critical that you optimize the retry count and the interval so that they match the use case. If you do not attempt to retry a sufficient number of times, the application is going to be unable to complete its operation. As a result, it's likely to experience a failure. On the contrary, if you're retrying to many times or with intervals that are too short in between retries, the application may tie up resources such as threads, memory, and connections for too long, adversely affecting application performance. The values that you set for time interval and retry attempts is going to depend on the type of operation that you are attempting. Operations that are part of a user interaction should be handled with short intervals and only a few retries. This should be done to avoid making users wait for a response. On the flip side, an operation that is part of a long-running critical workflow might require a longer wait time between attempts. 

You might also want to retry more than you would as part of a user interaction. Determining the appropriate intervals between retries quite often is the most difficult part of designing a successful strategy. Common strategies use the following types of retry intervals, exponential back-off, incremental intervals, regular intervals, immediate retry, and randomization. With exponential back-off, an application will wait a short time before attempting the first retry. It will then wait exponentially longer between each subsequent retry. For example, it might first retry an operation after five seconds. Next it would retry after 10 seconds and then 20 seconds and then 50 seconds and then so on. When leveraging incremental intervals, an application will wait a short time before the first retry. It will then incrementally increase the amount of time between each subsequent retry. An application leveraging regular intervals will wait for the same amount of time between each retry attempt. For example, it may retry an operation every two seconds. Immediate retry is self-explanatory. This type of retry interval is helpful in cases where a transient fall is extremely short.

 In cases such as this, retrying an operation immediately is often the appropriate action because it may well succeed immediately if the fault has already cleared in the time it takes the application to send the next retry request. Keep in mind, however, that you should never use more than one immediate retry attempt. Instead, you should switch to a different strategy if the first immediate retry fails. Any one of the retry strategies that we just discussed can include a randomization in order to prevent multiple instances of the client from sending subsequent retry attempts at the same time. Generally speaking, you should use one of the exponential back-off strategies when dealing with background operations. The immediate or regular interval retry strategies should be saved for interactive operations. In all cases, however, be sure to choose a delay and retry count so that the maximum latency for all retry attempts remains within the required end-to-end latency requirement. It's also important to remember that the combination of all factors that contribute to a maximum timeout period for a retried operation should also be considered. Considering such factors might include the time that was required for a failed connection to produce an initial response, the delay between retry attempts, as well as the maximum number of retries. When you add all of these up, they can in fact result in very long overall operational times. 

This is especially true if you are leveraging an exponential delay strategy process that needs to meet specific service level agreements. These must be handled in a way that ensures that the overall operation time, including timeouts and delays, remains within the defined SLA levels. While it's important to remain aggressive with retry strategies, you can also become too aggressive. Intervals that are too short with too many retries can obviously adversely impact the target resource or service as a result, can also prevent the resource or service from recovering from its overloaded state. As such, it may continue to refuse requests. Be sure to consider the timeout of operations when setting retry intervals. Doing so allows you to avoid launching subsequent attempts immediately. An example of this would be a case when the timeout is similar to or identical to the retry interval value. It's important to use the type of exception and any data that it provides, along with error codes and messages returned from the service to optimize the interval and the number of retries that you configure.

About the Author
Learning Paths

Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.

In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.

In his spare time, Tom enjoys camping, fishing, and playing poker.