The course is part of these learning paths
Azure Compute Infrastructure
Microsoft Azure offers services for a wide variety of compute-related needs, including traditional compute resources like virtual machines, as well as serverless and container-based services. In this course, you will learn how to design a compute infrastructure using the appropriate Azure services.
Some of the highlights include:
- Designing highly available implementations using fault domains, update domains, availability sets, scale sets, availability zones, and multi-region deployments
- Ensuring business continuity and disaster recovery using Azure Backup, System Center DPM, and Azure Recovery Services
- Creating event-driven functions in a serverless environment using Azure Functions and Azure Log Apps
- Designing microservices-based applications using Azure Container Service, which supports Kubernetes, and Azure Service Fabric, which is Microsoft’s proprietary container orchestrator
- Deploying high-performance web applications with autoscaling using Azure App Service
- Managing and securing APIs using Azure API Management and Azure Active Directory
- Running compute-intensive jobs on clusters of servers using Azure Batch and Azure Batch AI
- Design Azure solutions using virtual machines, serverless computing, and microservices
- Design web solutions using Azure App Service
- Run compute-intensive applications using Azure Batch
- People who want to become Azure cloud architects
- People preparing for Microsoft’s 70-535 exam (Architecting Microsoft Azure Solutions)
- General knowledge of IT architecture
Some industries need massive amounts of compute power for their applications, such as medical research and weather forecasting. Now with the rise of artificial intelligence, the need for high performance computing is spreading to almost every organization.
Since you can spin up large clusters of VMs on Azure, it’s a great place for running HPC applications. You can build your own solution or you can use one of Microsoft’s offerings to make it easier.
One solution is to use Microsoft HPC Pack, which is a set of tools for building an HPC cluster. HPC Pack has been around since before Azure was even in existence, but running it on Azure VMs is a lot easier than running it on-premises.
Microsoft has another offering that was specifically designed for the cloud, though. It’s called Azure Batch. This service manages the underlying infrastructure and HPC software, but still gives you the ability to specify what compute resources you need. The Batch service itself doesn’t even cost anything, but you still have to pay for the compute resources, of course.
Suppose you work for a digital animation company and you need to render the images in a movie. First, you’d upload the data files, which would be animation scene files in this case, to Azure Storage. You’d also upload the application that would process these data files to Azure Storage. Then you’d create a Batch pool of compute nodes. This is when you would tell it what size of VMs you want, how many to put in the pool, what operating system to run, etc. Next, you’d create a job to run on the pool. Then, you’d add tasks to the job. These tasks would be scheduled to automatically run on the pool by the Batch service. The tasks would run the application you uploaded on the data files you uploaded. When the tasks are done, the output files, which would be render files in this case, could be transferred to Azure Storage.
To make all of this work, you need to design the application so that multiple copies of it can run in parallel. Each node in the pool should take one part of the data and process it without having to communicate with any of the other nodes and without storing data locally. This makes it an “embarrassingly parallel” workload that can easily scale.
It’s also possible to run tightly coupled workloads on Azure Batch. These are applications where the nodes do need to communicate with each other. That’s normally done using the Message Passing Interface (or MPI).
When you specify what type of VMs to put in the pool, in many cases you can improve performance by selecting VMs with graphics processing units (or GPUs) on them. Many compute-intensive applications work well with these specialized processors. Alternatively, you can use traditional CPUs, but choose HPC-optimized VMs that have high performance components. In addition to fast CPUs and storage, some of the HPC VMs also have large memory capacity. For MPI applications, you can also choose VMs with low latency, high bandwidth networking.
These high performance VMs are pretty expensive, but fortunately, you can take advantage of low-priority VMs to save a huge amount of money. Low-priority VMs typically cost between 65 and 80% less than normal-priority VMs. The difference is that low-priority VMs may not be available when you need them, because they run in Azure’s surplus capacity. Even if you’re able to allocate low-priority VMs for a job, they could be preempted and you would lose them. Now you can see why they’re so cheap.
The great thing about Azure Batch is that it’s ideally suited to using low-priority VMs. If you’re running an embarrassingly parallel job and some of the VMs get preempted, it’s not a big deal, because the job will keep running on the remaining VMs, and the interrupted tasks will be automatically requeued. You can even allocate a certain number of dedicated VMs to guarantee that your job will keep running no matter what happens to the low-priority VMs.
Of course, MPI-based applications aren’t well suited to using low-priority VMs because if the application loses a VM, you’d probably have to rerun the entire job. Applications with long-running tasks are also a poor fit because it would be time-consuming to rerun tasks that get interrupted.
Another decision to make is how long to keep your Batch pools running. If you only run jobs periodically, then it would make sense to create a pool when you need to run a job and delete it when the job is finished.
If you need a job to start immediately and you know when you’re going to start it, then you can create a pool ahead of time. If you run jobs almost all of the time, then you should leave your pools running all of the time. If you always have jobs running, but the load varies a lot, then you can scale the pool up and down as needed.
As with most Azure services, you can run Azure Batch from the portal, the CLI, or from your code. All three methods provide rich monitoring capabilities. For example, if you run it from your code, you can request the status of all of the tasks in a job. You can call the Get Task Counts operation to find out how many tasks are active, running, and completed, as well as how many succeeded and failed.
If you need to run a machine learning workload, then you should probably use Azure Batch AI rather than just Azure Batch. Although their names are almost identical, these two services are actually fairly different from each other. With Batch AI, you create clusters rather than pools, and the jobs have to use a machine learning library, such as TensorFlow or CNTK. Just like Azure Batch, though, it takes care of the implementation details, including autoscaling.
And that’s it for compute-intensive applications.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).