Apache Spark is an open-source framework for doing big data processing. It was developed as a replacement for Apache Hadoop’s MapReduce framework. Both Spark and MapReduce process data on compute clusters, but one of Spark’s big advantages is that it does in-memory processing, which can be orders of magnitude faster than the disk-based processing that MapReduce uses.
In 2013, the creators of Spark started a company called Databricks. The name of their product is also Databricks. It’s a cloud-based implementation of Spark with a user-friendly interface for running code on clusters interactively.
Microsoft has partnered with Databricks to bring its product to the Azure platform. The result is a service called Azure Databricks. One of the biggest advantages of using the Azure version of Databricks is that it’s integrated with other Azure services. For example, you can train a machine learning model on a Databricks cluster and then deploy it using Azure Machine Learning Services.
In this course, we will start by showing you how to set up a Databricks workspace and a cluster. Next, we’ll go through the basics of how to use a notebook to run interactive queries on a dataset. Then you’ll see how to run a Spark job on a schedule.
Learning Objectives
- Create a Databricks workspace, cluster, and notebook
- Run code in a Databricks notebook either interactively or as a job
Intended Audience
- People who want to use Azure Databricks to run Apache Spark for analytics
Prerequisites
- Prior experience with Azure and at least one programming language
Additional Resources
The GitHub repository for this course is at https://github.com/cloudacademy/azure-databricks.
All right, the cluster’s running. Remember how we configured it to shut down if it’s inactive for 120 minutes? Well, even if you hadn’t used this cluster for over 2 hours, its configuration would still exist, so you could start it up again.
Databricks saves the configuration of a terminated cluster for 30 days if you don’t delete the cluster. If you want it to save the configuration for more than 30 days, then all you have to do is click this pin. A pinned cluster can’t be deleted.
OK, now that you have a cluster running, you can execute code on it. You can do that by using a notebook. If you’ve ever used a Jupyter notebook before, then a Databricks notebook will look very familiar.
Let’s create one so you can see what I mean. The notebook will reside in a workspace, so click “Workspace”, open the dropdown menu, go into the Create menu, and select “Notebook”. Let’s call it “test”. For the language, you can choose Python, Scala, SQL, or R. We’re going to run some simple queries, so select “SQL”.
A notebook is a document where you can enter some code, run it, and the results will be shown in the notebook. It’s perfect for data exploration and experimentation because you can go back and see all of the things you tried and what the results were in each case. It’s essentially an interactive document that contains live code. You can even run some of the code again if you want.
Alright, let’s run a query. Since we haven’t uploaded any data, you might be wondering what we’re going to run a query on. Well, there’s actually lots of data we can query even without uploading any of it. Azure Databricks is integrated with many other Azure services, including SQL Database, Data Lake Storage, Blob Storage, Cosmos DB, Event Hubs, and SQL Data Warehouse, so you can access data in any of those using the appropriate connector. However, we don’t even need to do that because Databricks also includes some sample datasets.
To see which datasets are available, you can run a command in this command box. There’s one catch, though. When we created this notebook, we selected SQL as the language, so whatever we type in this command box will be interpreted as SQL. The exception is if you start the command with a percent sign and the name of another language. For example, if you wanted to run some Python code in a SQL notebook, you would start it with “%python”, and it would be interpreted properly.
Similarly, if you want to run a filesystem command, then you just need to start it with a “%fs”. To see what’s in the filesystem for this workspace, type “%fs ls”. The “ls” stands for “list” and will be familiar if you’ve used Linux or Unix.
To execute the command, you can either hit Shift-Enter, or you can select “Run cell” from this menu. I recommend using Shift-Enter because not only is that faster than going to the menu, but it also automatically brings up another cell for you so you can type another command.
You’ll notice that all of the folders start with “dbfs”. That stands for “Databricks File System”, which is a distributed filesystem that’s installed on the cluster. You don’t have to worry about losing data when you shut down the cluster, though, because DBFS is saved in Blob Storage.
The sample datasets are in the databricks-datasets folder. To list them, type “%fs ls databricks-datasets”. I’ve created a GitHub repository with a readme file that contains all of the commands and code in this course so you can copy and paste from there. The link to the repository is at the bottom of the course overview below.
To scroll through the list, click on the table first. Then the scroll bar will appear. There are lots of sample datasets, and they cover a wide variety of areas. For example, there’s one for credit card fraud, one for genomics, and one for songs.
Most of these folders don’t have very many datasets in them, but that’s not the case with the Rdatasets folder. It has over 700 datasets in it! I have to say it’s a pretty bizarre list of datasets, though. Some of them sound like made-up titles, such as “Prussian army horse kick data”, and some come from weirdly obscure experiments, such as “The Effect of Vitamin C on Tooth Growth in Guinea Pigs”, but my absolute favorite, which sounds like it comes from a mad scientist’s lab, is “Electrical Resistance of Kiwi Fruit”.
The one we’re going to use is pretty normal in comparison, although it’s still a bit odd. It shows what the prices were for various personal computers in the mid-1990s. Use this command to see what’s in it. The “head” command shows the first lines in a file, up to the maxBytes you specify, which is 1,000 bytes in this case. If you don’t specify MaxBytes, then it will default to about 65,000 bytes.
The first line contains the header, which shows what’s in each column, such as the price of the computer, its processor speed, and the size of its hard drive, RAM, and screen. Suppose we wanted to create a graph showing the average price of these 90s computers for each of the different memory sizes.
To run a query on this data, we need to load it into a table. A Databricks table is just a Spark DataFrame if you’re familiar with Spark. You can also think of it as being like a table in a relational database.
To load the csv file into a table, run these commands. The first command checks to see if a table named “computers” already exists, and if it does, then it drops (or deletes) it. You don’t have to do this, of course, because you haven’t created any tables yet, but it’s a good idea to do it. Why? Because if you wanted to run the code in this cell again, then the table would already exist, so you’d get an error if you didn’t drop the table first.
The second command creates the table. Note that it says there’s a header in the file. By setting header to true, it will name the columns for us, so we won’t have to do that ourselves. The “inferSchema” option is even more useful. It figures out the data type of each column, so we don’t have to specify that ourselves either.
Alright, now there are a couple of ways to see what’s in the table. One way is to click “Data” in the menu on the left, and then select the table. First, it shows you the schema. It labeled all of the columns according to the header line in the csv file. Notice, though, that the first column is called “_c0”. That’s because the header didn’t have a label for that column. The first column is just the record number, and we probably won’t need to refer to it in any queries, so it doesn’t matter that it has a generic name. If there hadn’t been a header row in the csv file (or if we hadn’t set the header option to true), then all of the columns would have names like this, which would make it more difficult to write queries.
To the right of the column name, it shows the data type. In this case, it figured out that most of the columns are integers. If we didn’t use the “inferSchema” option, and we didn’t specify the data type for each column, then it would set them all to “string”. Even worse, there’s no way to change the data types after you’ve created the table, so every time you needed to perform an operation on a numeric column, you’d have to cast it as the right data type. By using the “inferSchema” option, we don’t have to worry about any of that.
Under the schema, it shows a sample of the data in the table. This is the same data we saw when we ran the head command, but now it’s in a nicely formatted table.
While we’re here, I should point out that you can also create a table from this UI, which is a nice option because you can just point and click instead of having to write code. Unfortunately, you can’t get to the folder with the sample datasets in it from here, so we had to load in the Computers dataset using code.
To get back to the notebook, click on Workspace and select the notebook. It puts us back at the top of the notebook, which would be kind of annoying if this were a long notebook and we were doing something in the middle of it.
Another way to see what’s in the table is to run a SQL query. The simplest command would be “select * from computers”. If this were a really big table, then you might not want to run a “select *” on it since that reads in the entire table.
OK, so it shows a table just like what we saw in the Data UI. It also includes some controls at the bottom for displaying the results. It defaults to the first one, which is “raw table”, but you can also display it as a graph using this control. It’s graphing something that’s not very useful, though, so we need to click on plot options to tell it what we want to see.
Remember that we wanted to create a graph showing the average price of the computers for each of the different memory sizes. Get rid of the keys that it put in by default, and add “ram”, which is what we want to see. The preview of the graph changes whenever we change anything on the left, which is really useful. Then, for values, get rid of “Trend”, and put “price” in there.
We need to change the aggregation, down here, because it’s set to sum right now. We need to set it to average because we want to graph the average price of these computers. The preview is looking good, so click “Apply”.
Great, it worked. So a PC with 32 gig of memory used to cost over $3,500. That’s pretty expensive, but 32 gig of memory is a lot, so that doesn’t seem right for a 90s computer, does it? Well, actually, it’s 32 meg of memory. You’ve gotta love Moore’s Law.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).