Best Practices, Workloads and Use Cases
Start course

This course from Kevin McGragh, VP of Architecture at, explains how to leverage excess cloud capacity from providers such as AWS, Microsoft Azure, and Google Cloud Platform to optimize the costs of cloud computing using spot instances. 

Intended Audience

Everyone working with Cloud compute workloads, from start-ups to large corporations.


A basic understanding of cloud computing and cloud computing billing models. if you are new to cloud computing we recommend completing our What is Cloud Computing course first.

Learning Objectives

This course will enable you to:

  • Recognize and explain how to run and manage workloads on excess cloud capacity using Spot Instances. 
  • Recognize and explain the risks and benefits of the spot market.
  • Recognize and implement Spot Instances to reduce cloud compute costs.



- [Instructor] Section Three. Best Practices: Workloads and Use Cases Topic One Overview of Spot Instance use cases. As with most new technologies, development and test environments are great way to start utilizing spot instances to save money. These environments often cost a significant amount and are not usually as critical as production environments, making them a great first step in cultivating a spot strategy. There are some situations to consider. First, some development environments have persistent instances for engineers to log into at any time. In such cases, stateful spot instances configured for longevity are recommended. Enabling mechanisms for in place backup and recovery keep engineers moving forward with little to no interruption. Second, when working with continuous integration, it can be costly to add on demand instances waiting for new commits. A pool of spot instances that grows and contracts with demand will reduce the overhead and keep engineers from asking for more money to provision more resources. Jenkins Builds are great use case here, as jobs are typically short and can be delegated to run on pool of workers. The pool can scale as needed by the Jenkins master, and everything can run on spot capacity. It is also important to know, not all resources need to be on all the time. Here is where scheduling option comes into play. For resources that, for example, can be shut off outside normal business hours, spot instances should be scheduled to shut down, either hibernated or terminated and relaunched when needed again. This will drive down costs further and allow your engineering teams to scale more as new code and applications are added. To summarize, development and test environments are great first step when starting the process of spot instance integration. Data Processing Amazon Elastic MapReduce or EMR is a service that streamlines big data processing providing a fast and cost-effective managed doop framework for distribution and processing vast amounts of data across a pool of Amazon EC2 instances. When creating clusters with instance leads Amazon EMR can automatically provision spot capacity across a variety of instance types. Select optimal Amazon EC2 availability zones and blend Spot and On-Demand capacity to minimize overall cost. When running a persistent long lived cluster of EMR with the consistent workload, lowering cost during the peak-demand periods is quite easy. In this case, it is recommended to launch the Master and Core nodes as On-Demand so they can handle the normal persistent capacity. Then launch Spot instances for the task nodes to handle peak-load requirements. For clusters where lowering cost is more important than the time to complete, or losing part of a workload for a short period of time is acceptable, it is recommended to run the entire cluster as Spot instances to benefit from the best savings possible. For testing a new application in order to prepare it for launch into a production environment, always run the entire cluster of Master, core and task as Spot instances to reduce cost during this testing period. Let's now focus a bit on running Task nodes specifically as Spot Instances. Task nodes process but do not hold persistent data in HDFS. If these nodes terminate due to Spot preemption, no data is lost and the effect on the cluster is minimal. When launching one or more task instance groups as Spot instances, Amazon EMR provisions as many task nodes as possible using a configured Spot price. Launching task instance groups as Spot instances is a strategic way to expand a capacity of your cluster while minimizing costs. Next, we will focus on running container workloads on Spot instances. Containers are a solution to the problem of how to get software to run reliably when moved from one computing environment to another. This could be from a developer's laptop to a test environment, from a staging environment into production, and perhaps, from a physical machine in a data center to a virtual machine in a private or public cloud. The abstraction layer that containers provide means that it's not necessary to require instances in the cluster to be fairly similar instance types. A TT Micro, for instance is able to run alongside at M4.16 large, and containers will distribute accordingly. This opens up more instance types as potential candidates allowing to diversify across more spot markets and leverage spot instances safely and effectively. We will now focus on two container orchestrators, Kubernetes and ECS. Although both Kubernetes and ECS will automatically scale, replicate and restart containers, they do not manage underlying compute. To ensure pods and tasks are gracefully rescheduled, it is important to proactively manage spot instance termination. For Kubernetes it is very important to watch the AWS termination two-minute warning and follow these steps. Detach the instance from any elastic load balancers if they are currently connected. Second, mark the instance as unschedulable. This will prevent new pods from being scheduled to that node, but will not affect any existing pods on the node. It is important to have capacity ready before existing instances terminate. If too many instances terminate before new compute is available, Kubernetes will not have enough resources to reschedule the pods. When working with Kubernetes, it is important to watch not only the underlying compute but the scheduling resources of the pods Kubernetes needs. Both of these need to match, so all applications can run efficiently. The Spotinst Kubernetes Autoscaler is a solution that does just this, matching pod resource requirements to available compute capacity. This ensures that application stay available and compute never falls out of scope with what an application needs. A similar process is used with Amazon elastic container service, or also known as ECS. ECS has a concept of draining. When nodes are set to draining, ECS prevents new tasks from being scheduled to placement by the container service. Service tasks that are marked as Pending are also prevented from running on these instances. Again, it is very important to have new compute capacity available before any spot instances are reclaimed by Amazon. It is also important to understand ECS Topology before any other scaling logic is applied. ECS has the concept of clusters, services and tasks. Each cluster has a set of services, and each set of services have running tasks. Each task definition notifies the scheduler to how many resources the application requires. Depending on the makeup of the cluster and the nodes within that cluster, there could be 10 machines of C3.large and ten machines of C4.extra large, having a total capacity of 61440 CPU units and 113 gigs of RAM. But if a single task requires 16 gigs of RAM, it would not be scheduled on that cluster, because no single node can handle the workload. This is one of the bigger challenges when working with containers and trying to obfuscate the underlying compute. Amazon ECS will report on CPU reservation, CPU utilization, Memory reservation, and Memory utilization. It is extremely important to map all of these metrics to all the task definitions and cluster instances that are available to the scheduler. This is a challenge for those trying to run containers on top of obfuscated compute. It is also why products, such as the Spotinst Autoscaler exist, to correctly match node and instance requirements with those of the ECS scheduler. Next, we will focus on running ELB workloads with spot instances. AWS Spot Fleet is fully integrated with Elastic Load Balancing to enable the attachment and detachment of spot instances to a scaling group. Spot Fleet, however, will not automatically drain instances for you. When the two-minute warning is received from AWS, it is important to handle this by detaching the instance from the ELB, so traffic is gracefully handled while the instance is shutting down. Heterogeneous scaling groups are great use case for AWS Spot Fleet when integrated with AWS auto-scaling groups. It is important to know that auto-scaling groups cannot contain both on demand and spot instances at the same time. To get the desired affect of having enough capacity when spot instances go away, 2 Auto-Scaling groups will need to be created, one with Spot instances, one with On-Demand. Metrics from each group need to be monitored so that one group can scale up, while the other is scaling down. Third party integration, such as as Spotinst will automatically blend On-demand and Spot capacity into a single scaling group. The provider will also handle all the attachments and detachments to any target groups or elastic load balancers that are needed to keep the application highly available. Finally, we will discuss running Stateful workloads with spot instances. The concept of data integrity and consistency is crucial when managing workloads. This aspect is not trivial while working with spot instances, which are conceptually ephemeral and can be revoked at any given moment. Running a stateful application requires constant snapshots of data, keeping any ENIs associated with private IP addresses, and EBS volume lifecycle management. This is one of the most complex to use cases to run with spot. A Spotinst Elastigroup, for instance will create automatic scheduled snapshots of the AMI and any attached EBS volumes. Using these options, it is possible to maintain data persistence within a cluster. When instance replacement occurs, it is extremely important to take a final snapshot and then use this snapshot to recreate a new spot instance that matches the state of the old instance. If snapshots are not regularly taken, the change rate might be too high on EBS volume to quickly recreate a new spot instance with a desired stateful data. If the instance is customized with any additional EBS volumes, it is important to either snapshot or move those volumes to the new instance as well. Lastly, private IPs are important. Launch the original instance with a custom ENI. This allows the network interface to be moved from one instance to another instance and will retain its private IP address.

About the Author

Kevin McGrath is the VP of Architecture for Spotinst, specializing in Cloud Native and Serverless product designs. Kevin is responsible for researching and evaluating innovative technologies and processes leaning on his extensive background in DevOps and delivering Software as a Service within the communications and IT infrastructure industry.

Kevin started his career at USi, the first Application Service Provider (ASP) 20 years ago.  It was here he began delivering enterprise applications as a service in one of the first multi-tenant shared datacenter environments. After USinternetworking was acquired by AT&T, Kevin served in the office of the CTO at Sungard Availability Services where he specialized in migrating legacy workload to cloud native and Serverless architectures.

Kevin holds a B.A. in Economics from the University of Maryland and a Masters in Technology Management from University of Maryland University College.