Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses. Not only does Spark handle data analytics tasks, but it also handles machine learning.
In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s a cloud-based implementation of Spark with a user-friendly interface for running code on clusters interactively.
Microsoft has partnered with Databricks to bring their product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.
In this course, we will start by showing you how to set up a Databricks workspace and a cluster. Next, we’ll go through the basics of how to use a notebook to run interactive queries on a dataset. Then you’ll see how to run a Spark job on a schedule. After that, we’ll show you how to train a machine learning model. Finally, we’ll go through several ways to deploy a trained model as a prediction service.
- Create a Databricks workspace, cluster, and notebook
- Run code in a Databricks notebook either interactively or as a job
- Train a machine learning model using Databricks
- Deploy a Databricks-trained machine learning model as a prediction service
- People who want to use Azure Databricks to run Apache Spark for either analytics or machine learning workloads
- Prior experience with Azure and at least one programming language
The GitHub repository for this course is at https://github.com/cloudacademy/azure-databricks.
Once you have a trained model that seems to work well, you’ll probably want to use it in a production environment. At this point in time, deploying a Databricks-trained model on Azure is not as straightforward as it is with some other tools. For example, Azure Machine Learning Studio has a very easy way to deploy a trained model as a web service. It’s possible to do that with a model that was trained using Databricks, but there are many different ways to do it, and all of them require quite a bit more work.
First, you need to save the trained model. Then you can import it into the production system where you want to run it. There are 3 options for saving the model: MLWriter, MLeap, and Databricks ML Model Export.
MLWriter is a Spark class that can save ML components. It’s an abstract class, and its methods are inherited by Spark’s ML components. For example, to save a pipeline, you just run the write and save methods on the pipeline. Loading a saved model on another system is just as easy. All you need to do is run the read and load methods on an empty pipeline.
The production system where you load the saved pipeline needs to support Spark, of course. There are a number of options for running Spark on Azure, such as HDInsight or a Data Science Virtual Machine. The best option, though, is probably the Azure Machine Learning service, which can run a variety of Python-based frameworks, such as Spark, TensorFlow, and PyTorch. In fact, the Azure ML service deploys trained models on other Azure compute services, such as Azure Container Instances. One of the Azure ML service’s best deployment options is AKS, the Azure Kubernetes Service. It’s a container-based service that autoscales up and down as needed.
Although you can easily access the Azure ML service from Databricks, it still requires quite a bit of code to set up a prediction service. At a high level, here are the steps you need to take. First, you need to install the Azure ML SDK as a library in Databricks. Then you can attach that library to one or more of the Databricks clusters. Once you’ve done that, you’ll be able to call the Azure ML service from any of the Databricks clusters that have the library attached to them.
If you don’t already have an Azure ML workspace, you’ll need to create one. The workspace is where you keep track of everything related to your machine learning activities. In this case, you’re going to use it for model deployment, so you need to register your trained model in the workspace.
Next, you need to create a scoring script. This script will be used to load input data, feed it into the model, and return a prediction. It has to include an init function and a run function.
After that, you need to create a container image. The image should contain the trained model, the scoring script, and any dependencies that are required by either the model or the script. Dependencies are usually handled using Conda.
Once you have a container image, you have a number of options for where to deploy it. While you’re still developing your prediction service, a good option is Azure Container Instances. This service makes it easy to spin up a single container instance. If you expect to have a low volume of requests to your prediction service, then putting it on an Azure Container Instance may be a reasonable choice for production too.
In most cases, though, you’ll likely want to deploy it to Azure Kubernetes Service. AKS has the advantage of being able to scale up and down based on demand. It’s a full container orchestration service, and it’s becoming the standard for running containers on Azure.
There are also a couple of other deployment options that are more specialized. One option is to deploy to field-programmable gate arrays, also known as Project Brainwave. These are chips that are specially designed for running deep neural networks, so they’re extremely fast when used for that purpose.
Another option is to deploy to Azure IoT Edge devices. These are Windows and Linux devices that are not in the Azure cloud. If you deploy to a device that has the Azure IoT Edge runtime, then that device can perform predictions without having to use the cloud, so the response time is faster.
Regardless of which deployment target you’ve used, you need to create a webservice on it to make it easily accessible. Then you can send data to the URL of the webservice, and it will send predictions back.
You can find a set of notebooks that show you an example of how to code all of these steps at this URL. I’ve already imported these notebooks into my Databricks workspace. I won’t go through the details, but I’ll give you a quick overview of what’s in the notebooks.
The first notebook creates an Azure ML workspace. The next two notebooks ingest the data and build, train, and save the model. The only part we’re interested in is where it saves the model. This line of code is basically the same as what I showed you before except that it also includes the overwrite method, just in case the file already exists from a previous run.
The fourth notebook takes care of the deployment, so it’s the one you’ll want to look at the most. It registers the trained model in the Azure ML workspace, creates a scoring script, creates a Conda file, creates a container image, deploys the image to an Azure Container Instance, creates a web service, and sends some test data to the web service.
At the beginning of this lesson, I mentioned that there are 3 options for saving a trained model. So far, I’ve only talked about the first of those: MLWriter. Databricks also supports two export formats: Databricks ML Model Export and MLeap. These formats are platform-independent, so you can load models on both Spark and non-Spark systems.
The Databricks ML Model Export format is the same as what MLWriter produces except that it’s in JSON and it includes some extra metadata. The advantage of including the extra metadata is that the saved model or pipeline can be run on a non-Spark system. To implement a prediction service, you need to install the dbml-local runtime library and make calls to its APIs.
MLeap is open source. Somewhat surprisingly, Databricks actually recommends MLeap rather than its own export format. In addition to Spark, MLeap also supports scikit-learn and TensorFlow. Like Databricks ML Model Export, MLeap has an execution engine for running pipelines. When you export a pipeline, it goes into an MLeap Bundle. The execution engine is called the MLeap Runtime. You can implement a prediction service by running an MLeap Bundle on any system that has the MLeap Runtime installed.
And that’s it for deploying a trained model.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).