Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses.
In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s a cloud-based implementation of Spark with a user-friendly interface for running code on clusters interactively.
Microsoft has partnered with Databricks to bring its product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.
In this course, we will start by showing you how to set up a Databricks workspace and a cluster. Next, we’ll go through the basics of how to use a notebook to run interactive queries on a dataset. Then you’ll see how to run a Spark job on a schedule.
Learning Objectives
- Create a Databricks workspace, cluster, and notebook
- Run code in a Databricks notebook either interactively or as a job
Intended Audience
- People who want to use Azure Databricks to run Apache Spark for analytics
Prerequisites
- Prior experience with Azure and at least one programming language
Additional Resources
The GitHub repository for this course is at https://github.com/cloudacademy/azure-databricks.
Notebooks are great for experimentation, but what if you’ve put together a solid workflow, and you want to run it on a regular schedule? The answer is to create a job.
A job is simply one way of running an entire notebook. The advantage of running a job rather than running a notebook manually is that you can schedule when it will run. It also allows you to keep a record of previous runs.
I’ll show you an example of how to create and run a job. We’ll use the notebook that we created in the last lesson. Click the Jobs icon on the left. Then click “Create Job”. Let’s call it “test”. Click “Select notebook”. Choose “test”, which is the notebook we created in the last lesson. We don’t need to set any job parameters or dependent libraries, but this is where you would do that if you needed to for a particular job.
Now we’ll tell it which cluster to use. You can choose either an existing cluster or a new one. I already have an existing cluster, so it would seem logical that I would choose that one. However, if you run a job on an existing cluster instead of on a new one, it’ll be more expensive. That seems strange, doesn’t it? Why would it cost more to run a job on an existing cluster than on a new one? Well, it’s because when you run a job on an existing cluster, Databricks considers it to be an interactive workload, so you get charged the interactive price. If you create a new cluster for the job to run on, and that cluster is only active while the job is running, then you get charged the automated workload price, which is less.
So let’s tell it to use a new cluster. Check the autoscaling option so it will start with just 2 workers instead of 8. We can leave everything else with the defaults. Click the “Confirm” button.
Alright, now we can tell it when to run the job. Click “Edit” next to “Schedule”. The scheduler is pretty simple. You can run the job every hour, every 2 hours, etc. Or you can run it every week, every 2 weeks, etc. Let’s say we want to run it every day at 10 PM. You have to use 24-hour time. 10 PM is 22:00. And you can set your local timezone. Now click “Confirm” and that’s all you need to do. The job is set to run every day at 10 PM.
You’ll notice that there’s also an option to run the job right now. Let’s do that so you can see what happens when a job runs. It shows up under “Active runs”. You can click on the run to get more details. This updates every 6 or 7 seconds, so you don’t need to refresh the page yourself. It’s going to take a while for the cluster to spin up, so I’ll fast forward.
OK, the job’s done and here’s the output. It’s the same as what we saw when we ran the code manually in the notebook. Of course, this isn’t the sort of notebook that you’d want to run on a schedule, but this example was just to show you how jobs work.
And that’s it for running jobs.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).