Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses. Not only does Spark handle data analytics tasks, but it also handles machine learning.
In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s a cloud-based implementation of Spark with a user-friendly interface for running code on clusters interactively.
Microsoft has partnered with Databricks to bring their product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.
In this course, we will start by showing you how to set up a Databricks workspace and a cluster. Next, we’ll go through the basics of how to use a notebook to run interactive queries on a dataset. Then you’ll see how to run a Spark job on a schedule. After that, we’ll show you how to train a machine learning model. Finally, we’ll go through several ways to deploy a trained model as a prediction service.
- Create a Databricks workspace, cluster, and notebook
- Run code in a Databricks notebook either interactively or as a job
- Train a machine learning model using Databricks
- Deploy a Databricks-trained machine learning model as a prediction service
- People who want to use Azure Databricks to run Apache Spark for either analytics or machine learning workloads
- Prior experience with Azure and at least one programming language
The GitHub repository for this course is at https://github.com/cloudacademy/azure-databricks.
I hope you enjoyed learning about Azure Databricks. Let’s do a quick review of what you learned.
Apache Spark is an open-source framework for doing big data processing. Azure Databricks is a managed implementation of Spark in the cloud.
A Databricks workspace is where you store your notebooks and other related items. Although a notebook has a default programming language, you can add code from other languages by starting it with a percent sign and the name of the other language.
DBFS (or the “Databricks File System”) is a distributed filesystem that’s installed on a Databricks cluster and backed by Azure Storage.
A job is a way of running an entire notebook at scheduled times. It also keeps a record of previous runs. In most cases, it’s less expensive to run a job on a new cluster than on an existing cluster because you get charged the automated workload price, which is less than the interactive price. It’s also usually a good idea to select the autoscaling option so you don’t have to guess how many nodes the cluster should have.
MLlib is Spark’s own machine learning library, and it comes preinstalled on all of the Databricks runtime versions. An MLlib pipeline defines the workflow for a machine learning job. A transformer transforms one dataframe into another. An estimator is an algorithm that takes a dataframe and produces a model.
Three options for saving a trained model are MLWriter, MLeap, and Databricks ML Model Export. If you’re going to deploy the model to another Spark system, then MLWriter is the best choice. One good option is to deploy your trained model using Azure Machine Learning Services. If you’re going to deploy the model to a non-Spark system, then MLeap is the recommended option. You can implement a prediction service by running an MLeap Bundle on any system that has the MLeap Runtime installed.
Now you know how to create a Databricks workspace, cluster, and notebook; run code in a Databricks notebook either interactively or as a job; train a machine learning model using Databricks; and deploy a Databricks-trained machine learning model as a prediction service.
To learn more about Azure Databricks, you can read Microsoft’s documentation. Also watch for new Microsoft Azure courses on Cloud Academy, because we’re always publishing new courses. Please give this course a rating, and if you have any questions or comments, please let us know. Thanks and have fun with Azure Databricks!
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).