Working With Azure Databricks
Monitoring and Optimization
The course is part of these learning paths
Azure Data Lake Storage Gen2 (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
Data Lake Storage Gen2 is built on top of Blob Storage. This gives you the best of both worlds. Blob Storage provides great features like high availability and lifecycle management at a very low cost. Data Lake Storage provides additional features, including hierarchical storage, fine-grained security, and compatibility with Hadoop.
The most effective way to do big data processing on Azure is to store your data in ADLS and then process it using Spark (which is essentially a faster version of Hadoop) on Azure Databricks.
In this course, you will follow hands-on examples to import data into ADLS and then securely access it and analyze it using Azure Databricks. You will also learn how to monitor and optimize your Data Lake Storage.
- Get data into Azure Data Lake Storage (ADLS)
- Use six layers of security to protect data in ADLS
- Use Azure Databricks to process data in ADLS
- Monitor and optimize the performance of your data lakes
- Anyone interested in Azure’s big data analytics services
- Experience with Azure Databricks
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/azure-data-lake-gen2.
Let’s go through the process of creating a filesystem in Data Lake Storage. You can do this from the Azure portal, from PowerShell, or from the Azure CLI. We’re going to use the Azure portal, so open it in your browser.
Since Data Lake Storage Gen2 is built on top of Blob Storage, we need to create a storage account. In the menu on the left, click Storage Accounts. Then click Add.
For the resource group, you can either select an existing one or create a new one. I’ll create a new one called datalakegen2rg. The storage account name has to be unique across all of Azure, so I’ll start mine with “ca” for Cloud Academy and then datalakegen2. You’ll have to use something different.
Then choose a location near you. Make sure Account kind is set to StorageV2, also known as general purpose v2. Version 1 doesn’t support hierarchical storage. You can leave the rest of the settings with the defaults.
Then go to the Advanced tab and change Hierarchical namespace to Enabled. This is the setting that turns regular Blob Storage into Data Lake Storage Gen2. Once you’ve created a storage account, you can’t change this setting. Okay, now click Review + create and Create. It’ll take a while to deploy, so I’ll fast-forward.
All right, my storage account has finished deploying, so I can go to it. Now, to create a filesystem, we have to click on Containers. That’s because filesystems are actually containers. Now click the Add File system button. You can call it whatever you want, but let’s call it datalake. That’s it for creating a filesystem.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).