Running Spark on Azure Databricks

Intermediate

49s

7,234

4.8/5

Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses.

In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s a cloud-based implementation of Spark with a user-friendly interface for running code on clusters interactively.

Microsoft has partnered with Databricks to bring its product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.

In this lesson, we will start by showing you how to set up a Databricks workspace and a cluster. Next, we’ll go through the basics of how to use a notebook to run interactive queries on a dataset. Then you’ll see how to run a Spark job on a schedule.

Learning Objectives

Create a Databricks workspace, cluster, and notebook
Run code in a Databricks notebook either interactively or as a job

Intended Audience

People who want to use Azure Databricks to run Apache Spark for analytics

Prerequisites

Prior experience with Azure and at least one programming language

Additional Resources

The GitHub repository for this lesson is at https://github.com/cloudacademy/azure-databricks.

About the Author

Guy Hummel, opens in a new tab

Azure and Google Cloud Content Lead

Students

236,165

Courses

103

Learning paths

168

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).

Covered Topics

Big Data

Machine Learning

Analytics

Artificial Intelligence

Microsoft Azure

Analytics for Azure

Artificial Intelligence for Azure

Azure Databricks