Azure Data Lake Storage Gen2 (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
Data Lake Storage Gen2 is built on top of Blob Storage. This gives you the best of both worlds. Blob Storage provides great features like high availability and lifecycle management at a very low cost. Data Lake Storage provides additional features, including hierarchical storage, fine-grained security, and compatibility with Hadoop.
The most effective way to do big data processing on Azure is to store your data in ADLS and then process it using Spark (which is essentially a faster version of Hadoop) on Azure Databricks.
In this course, you will follow hands-on examples to import data into ADLS and then securely access it and analyze it using Azure Databricks. You will also learn how to monitor and optimize your Data Lake Storage.
Learning Objectives
- Get data into Azure Data Lake Storage (ADLS)
- Use six layers of security to protect data in ADLS
- Use Azure Databricks to process data in ADLS
- Monitor and optimize the performance of your data lakes
Intended Audience
- Anyone interested in Azure’s big data analytics services
Prerequisites
- Experience with Azure Databricks
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
Resources
The GitHub repository for this course is at https://github.com/cloudacademy/azure-data-lake-gen2.
As I mentioned earlier, ADLS is intended to work with Hadoop-compatible software. The best choice for that in Azure is Databricks, which is a managed Spark service. Even though Spark and ADLS are compatible, you still need to perform some security-related steps before you can use them together.
There are three different ways to authenticate Databricks to ADLS. The easiest way is called credential passthrough. It lets you use your own Azure Active Directory credentials to access ADLS. The next way is to use a service principal. That’s essentially an identity that you assign to a service. Then you give that identity access to ADLS. The third method is to embed the storage account access key in your Spark code on Databricks. This is not recommended because it’s a security risk to have an account key in plain text in your code.
I’m going to show you the credential passthrough method because it’s simple and secure. However, I should point out that there is a disadvantage with this method. It only works on a Premium Databricks workspace, which can be significantly more expensive.
First, we’ll create an Azure Databricks workspace. Then we’ll spin up a Spark cluster and create a notebook. Finally, we’ll access our ADLS filesystem from the notebook.
Okay, type databricks in the search bar. There it is. Click Add. Call it databricks1. For the resource group, use the one you created before. For the pricing tier, choose Premium. As I mentioned, this is required for using credential passthrough. Then click the Create button. It’ll take a few minutes, so I’ll fast forward.
All right, it’s finished. Click Refresh to see your new workspace, then click on it, and click Launch Workspace. Now we can create a cluster. Click Clusters and Create Cluster. It seems to be stuck, so I’m going to click Refresh again.
Okay, call it cluster1. Leave the cluster mode on Standard. Make sure the Terminate after 120 minutes of inactivity box is checked so it won’t cost you much if you forget to shut the cluster down when you’re done with it. You can even change it to a lower number of minutes if you want.
Then click Advanced Options, and check the Enable credential passthrough box. Now it says that only a single user (me, in this case) is allowed to run commands on this cluster. That’s because I selected Standard for the cluster mode. If I had selected High Concurrency, then it would have enabled credential passthrough for every user on this cluster. Okay, now we can click Create Cluster.
While it’s spinning up, we can create a notebook. Click Workspace, and then in the dropdown, select Create Notebook. Call it notebook1. I’m really creative with my names, aren’t I? Change the language to Python. The cluster should be set to the cluster you just created.
Okay, let’s try to access our filesystem using an ls command. To use a filesystem command, we need to start it with %fs. Then ls. Now, we have to refer to our filesystem with a URL that starts with abfss, which stands for Azure Blob File System Secure. Then colon, slash, slash, the name of the filesystem, which is datalake in our case, at the name of the storage account, which is cadatalakegen2 for me, but it will be different for you, then .dfs.core.windows.net.
To execute the command, use Shift-Enter. It might take a little while for it to connect to the cluster and run for this first command. There, it worked. There’s the file we uploaded earlier. I’ll zoom in a bit.
That’s a pretty cumbersome way to refer to our filesystem, though, so let’s mount it on this cluster to make it easier. Copy and paste these lines into the notebook. Change the storage account name here to your own.
It worked. Now we can refer to the filesystem as /mnt/datalake.
Before we move on to the next lesson, I’ll give you a quick outline of how you would authenticate using a service principal instead of credential passthrough.
First, create a service principal by registering an app in Azure Active Directory. The app, in this case, means your Databricks instance. Then go to your filesystem and assign the Storage Blob Data Contributor role to the service principal. Next, create a secret in an Azure Key Vault that the service principal can use to authenticate. Then, in your Databricks workspace, create an Azure Key Vault-backed secret scope. Finally, run a much more complicated version of the mount code I showed you earlier. This code includes the service principal ID, the names of the secret scope and secret, and the Azure AD tenant ID. Can you see why I used credential passthrough instead?
That’s it for this lesson. In the next lesson, you’ll see how to do some simple data analysis.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).