Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses.
In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s a cloud-based implementation of Spark with a user-friendly interface for running code on clusters interactively.
Microsoft has partnered with Databricks to bring its product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.
In this course, we will start by showing you how to set up a Databricks workspace and a cluster. Next, we’ll go through the basics of how to use a notebook to run interactive queries on a dataset. Then you’ll see how to run a Spark job on a schedule.
- Create a Databricks workspace, cluster, and notebook
- Run code in a Databricks notebook either interactively or as a job
- People who want to use Azure Databricks to run Apache Spark for analytics
- Prior experience with Azure and at least one programming language
The GitHub repository for this course is at https://github.com/cloudacademy/azure-databricks.
Before we can run Spark, we need to spin up a compute cluster, and before we can spin up a compute cluster, we need to create a Databricks workspace.
In the Azure portal, search for “databricks”. When it comes up, click on it. Then click “Add”.
The Workspace name can be anything. It doesn’t have to be globally unique. Let’s call it “course”. Then, either create a new resource group to put it in or use an existing one. I’ll use an existing one. For the pricing tier, choose either Trial or Standard. The Trial tier is free for 14 days. I’ve used up my trial, so I’ll choose Standard.
It takes a few minutes to create the workspace, so I’ll fast forward. You might need to click “Refresh” to see it. OK, it’s done. Now click on it. When you click the “Launch Workspace” button, it will take you to the Databricks portal, which is separate from the Azure portal.
Alright, now we can create a cluster. Click on “Create Cluster”.
After this demo was recorded, Databricks changed the default user interface to this UI that’s currently in preview. If you’re following along on your own account, you can get to the user interface that will be shown in this demo by clicking “UI preview” and deselecting “New UI is enabled is enabled”. Okay, now back to the demo.
Then you have to click “Create a cluster” again. You can call it anything. I’ll call it “spark”. The Cluster Mode can be either Standard or High Concurrency. We’re only going to run one job at a time, so leave it on Standard.
For the Databricks Runtime Version, you can leave it with the default, which might be different for you than this version. You can also leave the Python version with the default.
Make sure this box is checked. It can be expensive to run a cluster, so you’ll want to automatically shut the cluster down if it’s been inactive for a while. The default is 120 minutes, but I’m going to change it to 60, so it will shut down after being idle for an hour.
Under Worker Type, you can see that there are lots of options for what kind of virtual machines to put in the cluster. We’ll leave it on the default type. You’ll notice that the cluster will always have a minimum of 2 workers and can autoscale up to a maximum of 8 workers.
OK, now click “Create Cluster”. This will take a little while, too. As soon as it’s done, you should go to the next video and run some code on the cluster before it shuts down due to inactivity. See you there.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).