1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Running Spark on Azure Databricks

Overview

The course is part of this learning path

Introduction to Azure Machine Learning
course-steps 2 certification 1 lab-steps 1

Contents

keyboard_tab
Azure Databricks
2
Overview1m 47s
3
Setup2m 59s
4
Notebooks10m 34s
5
Jobs3m 58s
8
Summary2m 16s
play-arrow
Start course
Overview
DifficultyIntermediate
Duration42m
Students47
Ratings
5/5
star star star star star

Description

Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses. Not only does Spark handle data analytics tasks, but it also handles machine learning.

In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s a cloud-based implementation of Spark with a user-friendly interface for running code on clusters interactively.

Microsoft has partnered with Databricks to bring their product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.

In this course, we will start by showing you how to set up a Databricks workspace and a cluster. Next, we’ll go through the basics of how to use a notebook to run interactive queries on a dataset. Then you’ll see how to run a Spark job on a schedule. After that, we’ll show you how to train a machine learning model. Finally, we’ll go through several ways to deploy a trained model as a prediction service.

Learning Objectives

  • Create a Databricks workspace, cluster, and notebook
  • Run code in a Databricks notebook either interactively or as a job
  • Train a machine learning model using Databricks
  • Deploy a Databricks-trained machine learning model as a prediction service

Intended Audience

  • People who want to use Azure Databricks to run Apache Spark for either analytics or machine learning workloads

Prerequisites

  • Prior experience with Azure and at least one programming language

Additional Resources

The GitHub repository for this course is at https://github.com/cloudacademy/azure-databricks.

Transcript

You’re probably somewhat familiar with Spark already, but I’ll give a quick overview of both Spark and Databricks and how they work together.

Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses. There are plenty of other differences between the two systems, as well, but we don’t need to go into the details here.

Not only does Spark handle data analytics tasks, but it also handles machine learning. It has a library called MLlib that includes a variety of pre-built algorithms, such as logistic regression, naive Bayes, and random forest. At the moment, it doesn’t include neural networks. However, you can still create neural networks on Spark using other machine learning frameworks, such as TensorFlow.

In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s basically a managed implementation of Spark in the cloud, so you don’t have to worry about building clusters yourself. It also has a user-friendly interface for running code on clusters interactively.

Microsoft has partnered with Databricks to bring their product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services, which is something I’ll show you later in this course.

Alright, now let’s get Databricks set up so we can try out some of these cool features. If you’re ready, then I’ll see you in the next lesson.

 

About the Author

Students14498
Courses41
Learning paths21

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).