Working With Azure Databricks
Monitoring and Optimization
The course is part of these learning pathsSee 1 more
Azure Data Lake Storage Gen2 (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
Data Lake Storage Gen2 is built on top of Blob Storage. This gives you the best of both worlds. Blob Storage provides great features like high availability and lifecycle management at a very low cost. Data Lake Storage provides additional features, including hierarchical storage, fine-grained security, and compatibility with Hadoop.
The most effective way to do big data processing on Azure is to store your data in ADLS and then process it using Spark (which is essentially a faster version of Hadoop) on Azure Databricks.
In this course, you will follow hands-on examples to import data into ADLS and then securely access it and analyze it using Azure Databricks. You will also learn how to monitor and optimize your Data Lake Storage.
- Get data into Azure Data Lake Storage (ADLS)
- Use six layers of security to protect data in ADLS
- Use Azure Databricks to process data in ADLS
- Monitor and optimize the performance of your data lakes
- Anyone interested in Azure’s big data analytics services
- Experience with Azure Databricks
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/azure-data-lake-gen2.
All right, now we finally have everything set up, so we can have a look at the data. I’m going to use SQL to read it. Copy and paste this SQL code. Notice that it starts with %sql so Spark knows this is SQL instead of Python.
This code reads data from the radio.json file we uploaded earlier into a table called radio. First, it checks if a table called radio already exists, and if it does, then it drops the table. Of course, we know that it doesn’t exist because we haven’t created it yet, but if we were to run this cell a second time, then it would exist, so it’s always a good idea to start by dropping the table. Notice that we’re able to use this simple pathname here because we mounted the filesystem in the last lesson. If we hadn’t done that, we’d have to use the long URL that starts with abfss.
Next, we’ll do a SELECT * to see all of the data. This is a really small dataset, so there aren’t very many rows to show. These records show details about a sample of users of an online radio app. The level column shows whether they have a paid subscription or not. There’s also a location column, so let’s see which locations have the most paid subscribers.
Click the chart button down here. This isn’t a very useful chart, so click Plot Options. First, remove the columns that it selected. Then drag location to the Keys box, level to the Series groupings box, and level again to the Values box. Then change Aggregation to COUNT. That looks better. Now click Apply.
Okay, have a look. Do you notice anything weird? It might not be obvious at first, but none of the locations have both free and paid users. Let’s look at the data in a different way. Go into Plot Options again and change the Display type to Pie chart. There are only four locations with paid users, and none of those locations are in the free users pie chart.
Okay, that was a pretty simple analysis, but now you know how you can work with ADLS data from Azure Databricks. And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).