Working With Azure Databricks
Monitoring and Optimization
The course is part of these learning paths
Azure Data Lake Storage Gen2 (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
Data Lake Storage Gen2 is built on top of Blob Storage. This gives you the best of both worlds. Blob Storage provides great features like high availability and lifecycle management at a very low cost. Data Lake Storage provides additional features, including hierarchical storage, fine-grained security, and compatibility with Hadoop.
The most effective way to do big data processing on Azure is to store your data in ADLS and then process it using Spark (which is essentially a faster version of Hadoop) on Azure Databricks.
In this course, you will follow hands-on examples to import data into ADLS and then securely access it and analyze it using Azure Databricks. You will also learn how to monitor and optimize your Data Lake Storage.
- Get data into Azure Data Lake Storage (ADLS)
- Use six layers of security to protect data in ADLS
- Use Azure Databricks to process data in ADLS
- Monitor and optimize the performance of your data lakes
- Anyone interested in Azure’s big data analytics services
- Experience with Azure Databricks
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/azure-data-lake-gen2.
Before you can do something useful with a filesystem, you need to get data into it. There are many different ways to do that. At the moment, there’s no way to upload data from your desktop using the portal, although this will hopefully change soon. Instead, you have to install software on your desktop. Your options include AzCopy, Azure Storage Explorer, PowerShell, or the Azure CLI.
I’m going to do something a little bit different. I’m going to use AzCopy from Cloud Shell instead of from my desktop. If you’re not familiar with it, Cloud Shell is a really handy tool. It’s a very small virtual machine that you can use to run commands.
Before you can use AzCopy, though, you need to give yourself the Storage Blob Data Contributor role or AzCopy won’t work. So go back to your filesystem in the portal. You can go directly to the storage account by typing its name in the search bar. You may not even need to type the name because it will probably be in your list of recently accessed resources.
Click Containers, then click on our datalake filesystem. Now select Access control and go to the Role assignments tab. Then click Add and select Add role assignment. Now open the Role dropdown and select Storage Blob Data Contributor. Then type your name to find your user account. Select it and click Save.
Now we can go into Cloud Shell to run AzCopy. To start it, click on the Cloud Shell icon next to the search bar.
The first time you run Cloud Shell, it will ask for your permission to create a storage account that the Cloud Shell VM will use. This isn’t the first time I’ve used Cloud Shell, which is why I didn’t get that dialog box. If you do get the dialog box, then click the option to create a storage account.
Cloud Shell supports both PowerShell and the Bash shell. You can switch between them using this menu. To use AzCopy, select the Bash option. I’ll maximize this window so it’s easier to see everything.
First, let’s get a file to upload. You can get one from the GitHub repository for this course with this wget command
Then, we need to authenticate azcopy with the azcopy login command. It’ll come back with a URL and a code to enter to verify your device. Copy the code, then click on the URL. Then paste the code and select your account.
Now we can finally run this command to copy the file to the datalake filesystem. You’ll need to put the name of your own storage account here.
It looks like it worked, but one way to check is to go back to the portal and go to Storage Explorer. Then go into Filesystems and click on datalake. There’s the file.
And that’s it for ingesting data.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).