Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses. Not only does Spark handle data analytics tasks, but it also handles machine learning.
In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s a cloud-based implementation of Spark with a user-friendly interface for running code on clusters interactively.
Microsoft has partnered with Databricks to bring their product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.
In this course, we will start by showing you how to set up a Databricks workspace and a cluster. Next, we’ll go through the basics of how to use a notebook to run interactive queries on a dataset. Then you’ll see how to run a Spark job on a schedule. After that, we’ll show you how to train a machine learning model. Finally, we’ll go through several ways to deploy a trained model as a prediction service.
- Create a Databricks workspace, cluster, and notebook
- Run code in a Databricks notebook either interactively or as a job
- Train a machine learning model using Databricks
- Deploy a Databricks-trained machine learning model as a prediction service
- People who want to use Azure Databricks to run Apache Spark for either analytics or machine learning workloads
- Prior experience with Azure and at least one programming language
The GitHub repository for this course is at https://github.com/cloudacademy/azure-databricks.
Now that you know the basics of how Azure Databricks works, it’s time to use it to train a machine learning model. We’re going to train a model for the most common machine learning example there is — the MNIST handwritten digits dataset. There’s a very good chance you’re already familiar with the MNIST dataset, so I’m not going to go through it in great detail. I’m just going to use it as an example of how to do machine learning in Databricks.
The MNIST dataset contains 70,000 images of handwritten, single-digit numbers. The goal is to train a model to recognize these images, and then be able to recognize new handwritten digits that it hasn’t seen before.
There are a wide variety of ways we could do this. Probably the most common approach is to use a neural network, and the most common tool for building a neural network is TensorFlow. That’s definitely an option in Databricks. To use TensorFlow or scikit-learn or PyTorch or almost any other machine learning framework, we need to spin up a cluster that includes the Databricks Runtime for Machine Learning. When you create the cluster, you need to change the “Databricks Runtime Version” to one that has “ML” in it.
At this point, you might be saying, “Hey, why didn’t we choose an ML cluster before? Now we have to create another cluster.” Well, actually, we don’t have to because we can use the machine learning framework that comes with Spark. It’s called MLlib, and it comes preinstalled on all of the Databricks runtime versions.
MLlib is pretty similar to other ML frameworks, so again, I won’t go into it in great detail, but I do need to explain the concept of pipelines. An MLlib pipeline defines the workflow for a machine learning job. The workflow typically includes steps like preprocessing data and training a model.
For example, if you were creating a model for spam filtering, you would need to read in a large number of emails, split them into individual words, convert those words into features of the model, and then run a machine learning algorithm to produce a model that fits the data.
The stages shown in blue are called transformers. They transform one dataframe into another. For example, the tokenizer transforms the raw text into words.
The stage shown in red is called an estimator. It’s an algorithm that takes a dataframe and produces a model. In this example, a logistic regression algorithm is used to create a model that classifies emails as either spam or not spam.
You may have noticed that the model produced by the algorithm is colored blue. That’s because technically it’s a transformer. The model takes a dataframe as input and turns it into a set of predictions.
This pipeline would be used to train a model, but when it’s time to test the model, you need a similar, but slightly different, pipeline. This one is exactly the same except there’s no algorithm because the model has already been trained. In the testing phase, you need to feed a set of data into the model and see how accurate its predictions are. By using the same data preprocessing steps as you did in the training pipeline, you can ensure that the test data will be in the right format for the model.
OK, now we’re ready to look at the MNIST handwritten digits example. The Databricks documentation includes a complete notebook that will guide us through everything. It’s at this URL. To run the code in this notebook ourselves, we need to import it to our workspace. The way to do that is a little bit clunky. First, click the “Get notebook link”. Then click the “Import Notebook” button. This just brings up a URL that you need to copy to the clipboard. Then you have to go to the Databricks console, click Workspace, and then in the Workspace menu, select “Import”. Now say you want to import from a URL and paste the URL here. Then you can finally import it.
Alright, now this notebook has a mix of documentation and code, which is ideally what you want to have in a notebook so it’s easy to understand what the code does. If this type of code doesn’t look familiar to you, that might be because it’s in Scala, which it tells you up here. If you haven’t used Scala before, don’t worry because it’s pretty similar to other languages, and we’re not going to go into the code in great detail anyway.
The first section loads the MNIST data from the databricks-datasets folder, which is where we found the sample data in the last lesson too. It loads both the training data and the test data. Let’s run the cell using the menu. Notice that there are a couple of other options in the menu. You can run all of the cells above this one or all of the cells below it. In both cases, it also runs this cell as well, not just the ones above or below it. To save time as we go through the code, let’s run all of the cells in this notebook right now. That way, we won’t have to wait for each section of code to run when we get to it.
Before we can run any code in this notebook, we have to attach it to a cluster. You can see that it says “Detached” here right now. To attach it to the spark cluster that we created before, just select it from the menu. Now it’s attached, but the cluster still isn’t running, so we have to “Start Cluster”. This is going to take a while, so I’ll fast forward.
Alright, the cluster is running now. Another way to run all of the code in a notebook is to click “Run All” up here. That’s what I’m going to do.
OK, so it loaded 60,000 training images and 10,000 test images. Then it displays the training data. The output is in a scrollable box. Each row has a label, which is the correct digit for this image, and the features, which are the pixels in the image.
The next few cells train the model. This notebook uses a decision tree, which is a bit unusual because the MNIST data is usually run through a neural network. Decision trees don’t work quite as well for image recognition, but they’re still pretty good.
This cell just does some imports. The next cell creates a pipeline with two stages. The first stage runs StringIndexer. This adds a new column to the data called “indexedLabel”. It turns the labels (which are the digits 0 through 9, in this case) into numeric indexes, starting at 0. Wait a minute. The labels are already numeric and they start at 0, so why do we need to do this? Well, actually, we don’t need to in this case. I think they just included it in this example so you could so what you’d normally do. Since the labels in most datasets are strings, such as apple or orange, you’d need to convert them into numeric values, but for the MNIST dataset, the labels are already in the form that we need.
The second stage of the pipeline is the Decision Tree Classifier algorithm.
The next cell runs the training data through the pipeline. This creates a decision tree. The great thing about decision trees is we can see exactly how they work. This cell displays the tree. You can see that it looks at feature 350 first. Remember that each feature is a pixel in the image. If feature 350’s value is less than this number, then it looks at feature 568, and so on, until it reaches a prediction as to which digit is represented by this image.
This is pretty cool, but the notebook doesn’t include code that tells us how accurate this model’s predictions are. To do that, we can add a new cell here. There are a couple of ways to do it, but the easiest way is to put your mouse pointer between the cells and a plus sign will appear. Click that to get a new cell. Then copy and paste these lines from the readme file in the GitHub repository for this course. This code runs the test data through the decision tree and evaluates the accuracy. You can see that it comes to just under 70% accurate. That’s pretty amazing considering how simple the decision tree is.
The rest of the notebook tries to fine-tune a couple of hyperparameters, which are parameters that are set before the model gets trained. The first one we look at is maxDepth, which specifies how deep the tree can be. By default, it’s set to 5, so that’s how many decisions layers were in the tree above.
This block of code trains 8 decision trees, each with a different number of decision layers. The number of layers ranges from 0 to 7 because “until 8” in Scala means up to, but not including, 8. The next cell creates the evaluator that will calculate the accuracy. Finally, this block of code will run the test data through each of the 8 decisions trees and evaluate the accuracy.
The results are displayed in this graph. You can see that as you add more layers to the decision tree, it gets more accurate. Here’s the one that we ran earlier. It has 5 decision layers, and it has an accuracy of 70%. By adding a couple more layers, we can get to 79%. Adding even more layers will probably not increase the accuracy by much, if at all, and we risk overfitting if we add too many layers.
This notebook tries to tweak one more hyperparameter to see how much of a difference it makes in the accuracy. This hyperparameter is called maxBins. The images in the MNIST database are in grayscale. This means the pixel values are in a continuous range depending on how light or dark they are. Since there’s a huge number of potential values for each pixel, it can take a very long time to train the model. To speed up the training, the algorithm divides these values into a smaller number of “bins”. For example, if you divide the range of values into only two bins, then each pixel is either black or white, with nothing in between.
The question is, “How much difference does it make if you drastically reduce the number of bins?” The answer is, “not much.” Although this graph makes it look like the accuracy fluctuates wildly as you reduce the number of bins, you have to look at the scale of the y-axis. All of these accuracy rates are .75 something. So the accuracy is pretty good even on black and white images.
And that’s it for training a machine learning model.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).