Running Apache Spark on Azure Databricks

In this article, we’ll cover how to set up an Azure Databricks cluster and how to run queries in an interactive notebook. However, this article only scratches the surface of what you can do with Azure Databricks. If you would like to learn more, including how to create graphs, run scheduled jobs, and train a machine learning model, then check out my complete, video-based Running Spark on Azure Databricks course on Cloud Academy.

Watch this short video taken from the course to get an idea of what you’ll learn:

Apache Spark and Azure Databricks

Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses. There are plenty of other differences between the two systems, as well, but we don’t need to go into the details here.

Not only does Apache Spark handle data analytics tasks, but it also handles machine learning. It has a library called MLlib that includes a variety of pre-built algorithms, such as logistic regression, naive Bayes, and random forest. At the moment, it doesn’t include neural networks. However, you can still create neural networks on Spark using other machine learning frameworks, such as TensorFlow.

In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s basically a managed implementation of Apache Spark in the cloud, so you don’t have to worry about building clusters yourself. It also has a user-friendly interface for running code on clusters interactively.

Microsoft has partnered with Databricks to bring their product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.

Setup

Now let’s see how to set up an Azure Databricks environment. You need to perform two tasks:

  1. Create a Databricks workspace
  2. Spin up a compute cluster

In the Azure portal, search for databricks. When it comes up, click on it.

Search for Azure Databricks

Then click Add.

Create Azure Databricks Service for Apache Sparks

The Workspace name can be anything. It doesn’t have to be globally unique. Let’s call it course. Then, either create a new resource group to put it in or use an existing one. For the pricing tier, choose either Trial or Standard.The Trial tier is free for 14 days.

When Azure is finished creating the workspace, click on it. Then, when you click the Launch Workspace button, it will take you to the Databricks portal, which is separate from the Azure portal. Alright, now we can create a cluster. Click Create Cluster. Then you’ll see this screen.
Create Apache Spark Cluster
You can give the cluster any name you want. Let’s call it spark. The Cluster Mode can be either Standard or High Concurrency. We’re only going to run one job at a time, so leave it on Standard.

For the Databricks Runtime Version, you can leave it with the default, which might be different for you than this version. You can also leave the Python version with the default.

Make sure the Terminate after __ minutes of inactivity box is checked. It can be expensive to run a cluster, so you’ll want to automatically shut the cluster down if it’s been inactive for a while. The default is 120 minutes, but you can change it to something lower, like 60, so it will shut down after being idle for a shorter period than two hours.

Under Worker Type, you can see that there are lots of options for what kind of virtual machines to put in the cluster. Leave it on the default type. You’ll notice that the cluster will always have a minimum of two workers and can autoscale up to a maximum of eight workers.

OK, now click Create Cluster. It will take a little while to finish.

Running queries

Once your cluster is ready, you can execute code on it. You can do that by using a notebook. If you’ve ever used a Jupyter notebook before, then a Databricks notebook will look very familiar.

Let’s create one so you can see what I mean. The notebook will reside in a workspace, so click Workspace, open the dropdown menu, go into the Create menu, and select Notebook.

Create Azure Databricks Notebook

Let’s call it test. For the language, you can choose Python, Scala, SQL, or R. We’re going to run some simple queries, so select SQL.

Create Apache Spark Notebook

A notebook is a document where you can enter some code, run it, and the results will be shown in the notebook. It’s perfect for data exploration and experimentation because you can go back and see all of the things you tried and what the results were in each case. It’s essentially an interactive document that contains live code. You can even run some of the code again if you want.

Alright, let’s run a query. Since we haven’t uploaded any data, you might be wondering what we’re going to run a query on. Well, there’s actually lots of data we can query even without uploading any of it. Azure Databricks is integrated with many other Azure services, including SQL Database, Data Lake Storage, Blob Storage, Cosmos DB, Event Hubs, and SQL Data Warehouse, so you can access data in any of those using the appropriate connector. However, we don’t even need to do that because Databricks also includes some sample datasets.

To see which datasets are available, you can run a command in the command box. There’s one catch, though. When we created this notebook, we selected SQL as the language, so whatever we type in this command box will be interpreted as SQL. The exception is if you start the command with a percent sign and the name of another language. For example, if you wanted to run some Python code in a SQL notebook, you would start it with %python and it would be interpreted properly.

Similarly, if you want to run a filesystem command, then you just need to start it with %fs. To see what’s in the filesystem for this workspace, type: %fs ls

The ls stands for list and will be familiar if you’ve used Linux or Unix.

To execute the command, you can either hit ShiftEnter, or you can select Run cell from this menu. I recommend using ShiftEnter because not only is that faster than going to the menu, but it also automatically brings up another cell for you so you can type another command.

Cloud Academy Running Spark on Azure Databricks

You’ll notice that all of the folders start with dbfs (Databricks File System), which is a distributed filesystem installed on the cluster.You don’t have to worry about losing data when you shut down the cluster, though, because DBFS is saved in Blob Storage.

The sample datasets are in the databricks-datasets folder. To list them, type:

%fs ls databricks-datasets

The one we’re going to use shows what the prices were for various personal computers in the mid-1990s. Use this command to see what’s in it: 

%fs head --maxBytes=1000 dbfs:/databricks-datasets/Rdatasets/data-001/csv/Ecdat/Computers.csv
SQL Query Code

The head command shows the first lines in a file, up to the maxBytes you specify, which is 1,000 bytes in this case. If you don’t specify MaxBytes, then it will default to about 65,000 bytes.

The first line contains the header, which shows what’s in each column, such as the price of the computer, its processor speed, and the size of its hard drive, RAM, and screen.

To run a query on this data, we need to load it into a table. A Databricks table is just an Apache Spark DataFrame, if you’re familiar with Spark. You can also think of it as being like a table in a relational database.

To load the csv file into a table, run these commands:

DROP TABLE IF EXISTS computers;
CREATE TABLE computers
USING csv
OPTIONS (path "/databricks-datasets/Rdatasets/data-001/csv/Ecdat/Computers.csv", header "true", inferSchema "true")

The first command checks to see if a table named computers already exists, and if it does, then it drops (or deletes) it. You don’t have to do this, of course, because you haven’t created any tables yet, but it’s a good idea to do it. Why? Because if you wanted to run the code in this cell again, then the table would already exist, so you’d get an error if you didn’t drop the table first.

The second command creates the table. Note that it says there’s a header in the file. By setting header to true, it will name the columns for us, so we won’t have to do that ourselves. The inferSchema option is even more useful. It figures out the data type of each column, so we don’t have to specify that ourselves either.

To see what’s in the table, run a SQL query. The simplest command is:

select * from computers

If this were a really big table, then you might not want to run a select * on it since that reads in the entire table, but it’s okay in this case.

SQL Query Results

This is the same data we saw when we ran the head command, but now it’s in a nicely formatted table.

Learn more

Check out the full video-based Running Spark on Azure Databricks course on Cloud Academy.

Apache Sparks Course

 

 

 

 

 

 

 

 

Avatar

Written by

Guy Hummel

Guy is a certified cloud architect on all three of the major public cloud platforms: AWS, Azure, and Google Cloud Platform. He launched his first training website in 1995 and he's been helping people learn IT technologies ever since. Guy’s passion is making complex technology easy to understand.


Related Posts

Alisha Reyes
Alisha Reyes
— March 17, 2020

Cloud Academy’s Blog Digest: How Do AWS Certifications Increase Your Employability, How to Become a Microsoft Certified Azure Data Engineer, and more

With everything going on right now, it's likely that the only thing you've been reading lately is related to the coronavirus pandemic. It's important to stay informed during these times, but it's also good to jump into something that can take your mind off of the current situation for j...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • programming
  • Security
Avatar
Cloud Academy Team
— March 13, 2020

Which Certifications Should I Get?

As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Avatar
Guy Hummel
— March 10, 2020

How to Become a Microsoft Certified Azure Data Engineer

Data engineering is one of the most sought-after skills in the job market. According to a 2019 Dice.com report, there was an 88% year-over-year growth in job postings for data engineers, which was the highest growth rate among all technology jobs. If you want to become a data enginee...

Read more
  • Azure
  • Data Engineer
  • DP-200
  • DP-201
  • Microsoft
Alisha Reyes
Alisha Reyes
— March 7, 2020

New on Cloud Academy: Intro to GitOps; AWS Courses; Java, Python, Amazon Linux 2, Ubuntu, & Docker Playgrounds; and much more

New Lab Playgrounds This month, our Content Team released six new "playground labs." Our playground labs provide a safe and secure sandbox environment for you to explore your own ideas, follow along with Cloud Academy courses, or answer your own questions — all without having to instal...

Read more
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Alisha Reyes
Alisha Reyes
— March 6, 2020

New on Cloud Academy: Intro to GitOps; AWS Courses; Java, Python, Amazon Linux 2, Ubuntu, & Docker Playgrounds; and much more

New Lab Playgrounds This month, our Content Team released six new "playground labs." Our playground labs provide a safe and secure sandbox environment for you to explore your own ideas, follow along with Cloud Academy courses, or answer your own questions — all without having to instal...

Read more
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Avatar
Thomas Mitchell
— February 27, 2020

5 Steps to Vulnerability Management for Containers

Organizations have begun embracing containers due to their simplicity and to the fact that they allow for a faster development and deployment velocity. Although developers are thrilled with containers because they allow them to deliver solutions more quickly, security teams are sometime...

Read more
  • AZ-500
  • AZ-500 Exam
  • Azure
  • vulnerability management
Avatar
Chandan Patra
— February 21, 2020

Elasticsearch vs. CloudSearch: AWS Cloud Search Choices

Elasticsearch vs. CloudSearch: What's the main difference? Let's compare AWS-based cloud tools: Elasticsearch vs. CloudSearch. While both services use proven technologies, Elasticsearch is more popular, open source, and has a flexible API to use for customization; in comparison, CloudS...

Read more
  • AWS
  • Azure
  • cloudsearch
  • elasticsearch
Avatar
Andrew Larkin
— February 13, 2020

Cloud Academy Content Roadmap Updates

Welcome to our Q1 2020 roadmap. This is the content we plan to build over the next three months, between February 1 - and April 30, 2020. Let's look at some of our roadmap highlights. Atlassian Bamboo for CI/CD We had a lot of requests for practical guides on how to apply DevOps tool...

Read more
  • Artificial Intelligence
  • AWS
  • Azure
  • Docker
  • Google Cloud Platform
  • Kubernetes
  • Machine Learning
Alisha Reyes
Alisha Reyes
— February 7, 2020

New on Cloud Academy: Git Labs, CKA and CKAD Lab Challenges, AWS and Azure Learning Paths, AGILE, and Much More

We just kicked off our first Free Weekend of 2020. This means we've unlocked our Training Library for just 72 hours. Until Sunday at 11:59 pm (PST), you can get unlimited access to our industry-leading learning paths, courses, certification prep exams, and our most popular hands-on labs...

Read more
  • agile
  • AWS
  • Azure
  • Google Cloud Platform
  • Linux
  • OWASP
  • programming
  • red hat
  • scrum
Alisha Reyes
Alisha Reyes
— January 31, 2020

How to Unlock Complimentary Access to Cloud Academy

Are you looking to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Cybersecurity, Information Security, Python, Java, or another technical skill? Then you'll want to mark your calendars. Starting Friday, February 7 at 12:00 a.m. PST (3:00 a.m. EST), Cloud Acade...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house
Alisha Reyes
Alisha Reyes
— January 6, 2020

New on Cloud Academy: Red Hat, Agile, OWASP Labs, Amazon SageMaker Lab, Linux Command Line Lab, SQL, Git Labs, Scrum Master, Azure Architects Lab, and Much More

Happy New Year! We hope you're ready to kick your training in overdrive in 2020 because we have a ton of new content for you. Not only do we have a bunch of new courses, hands-on labs, and lab challenges on AWS, Azure, and Google Cloud, but we also have three new courses on Red Hat, th...

Read more
  • agile
  • AWS
  • Azure
  • Google Cloud Platform
  • Linux
  • OWASP
  • programming
  • red hat
  • scrum
Orion Withrow
Orion Withrow
— December 17, 2019

Azure Security: Best Practices You Need to Know

When it comes to Azure Security best practices, where do you begin? In a lot of ways, Azure is very similar to any other data center. But with that said, Azure can also be very different. Securing Azure can pose many unique challenges. The security of resources hosted in Azure is of the...

Read more
  • Azure
  • azure best practices
  • azure security center
  • Security