Working With Azure Databricks
Monitoring and Optimization
The course is part of these learning paths
Azure Data Lake Storage Gen2 (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
Data Lake Storage Gen2 is built on top of Blob Storage. This gives you the best of both worlds. Blob Storage provides great features like high availability and lifecycle management at a very low cost. Data Lake Storage provides additional features, including hierarchical storage, fine-grained security, and compatibility with Hadoop.
The most effective way to do big data processing on Azure is to store your data in ADLS and then process it using Spark (which is essentially a faster version of Hadoop) on Azure Databricks.
In this course, you will follow hands-on examples to import data into ADLS and then securely access it and analyze it using Azure Databricks. You will also learn how to monitor and optimize your Data Lake Storage.
- Get data into Azure Data Lake Storage (ADLS)
- Use six layers of security to protect data in ADLS
- Use Azure Databricks to process data in ADLS
- Monitor and optimize the performance of your data lakes
- Anyone interested in Azure’s big data analytics services
- Experience with Azure Databricks
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/azure-data-lake-gen2.
Data Lake Storage provides six different layers of security: authentication, access control, network isolation, data protection, advanced threat protection, and auditing.
ADLS supports three different authentication methods. Azure Active Directory is the ideal way to verify a user’s identity. The only potential issue is that users must be defined in the directory before they can access data. This is usually what you want because, in most cases, you don’t want unidentified people to access your data. However, there may be times when it would be cumbersome to create Azure AD accounts for everyone who needs to access your data lake.
For example, suppose you have an application that needs to give access to a large number of users for only one day. It would be a lot of unnecessary work to create Azure AD accounts for all of them and then delete them a day later. The solution is called a Shared Access Signature or SAS. You can create a SAS that only has access to specific data and has an expiry date and time, after which it is no longer valid.
The third authentication method is called a Shared Key. I don’t recommend using this method because it’s an older approach that doesn’t have any of the advantages of the other two methods.
For access control, ADLS has two layers: roles and access control lists, or ACLs. Here are the three primary roles. An owner of a data lake has full control over everything, including permissions for other users. A contributor can read, write, and delete data, but it can’t change the permissions of other users. A reader, in contrast, can only view data.
It’s really important to know that if you’re accessing ADLS from another service or from locally installed software, then you have to use slightly different roles: Storage Blob Data Owner, Storage Blob Data Contributor, and Storage Blob Data Reader. Assigning the wrong role is a frequent mistake. For example, assigning the Owner role will not give a user the permission they need when accessing ADLS from another service, which is not what you’d expect. You have to assign the Storage Blob Data Owner role to give them full access to the data lake.
Another important point is that role-based access control only works at the storage account or filesystem level, so you can’t use it for fine-grained permissions. For example, if you assign the contributor role to a user named Bob at the filesystem level, then Bob will be able to read, write, and delete all of the files and folders in that filesystem.
Microsoft offers another security layer called access control lists to handle permissions for files and folders. Each entry in an ACL specifies the read, write, and execute permissions for a specific user or group.
In this example, everyone has read, write, and execute permissions. Each access control list specifies these permissions for the owning user, owning group, named users, such as alice, in this case, and everyone else, also known as other.
Let’s look at an example. In your storage account, go to Storage Explorer. Then click FILE SYSTEMS and datalake. Now click New Folder. I’ll call it example. Then right-click on the folder and select “Manage Access”. It’s showing what permissions various users have on this folder. The owner is set to $superuser. Superusers include users who have been assigned the owner role for this filesystem and users who were authenticated using a shared key. On this folder, superusers have read, write, and execute permissions.
The owning group is also set to $superuser. If you click on Owning Group, you’ll see that it only has read and execute permission. Of course, superusers still have write access to this file because they are also the owner of the file, and the owner has write access.
You can change the owner or the owning group or add more users or groups to the ACL, but at the moment, you can only do that using the command line or the API. Hopefully, by the time you’re watching this, it will work in the portal, too.
Finally, it shows the permissions for everyone else. In this case, they don’t have any permissions, so they won’t be able to access this file at all. If you do want to give everyone read and execute permission for this folder, then you check just check the read and execute boxes and then click Save.
Be aware that it’s not enough to give a user or group permissions on an individual folder. They also have to have permissions on all of the folders leading up to that folder. So, in this example, Other (i.e. everyone else) wouldn’t be able to see the example folder unless I also gave Other read and execute permissions on the root folder.
Speaking of permissions, what exactly do they mean? Read and write are pretty obvious. But what about execute? Can you actually run a file in the data lake? No, you can’t. In fact, the execute permission doesn’t mean anything for a file. Folders are a different story, though.
To list the contents of a folder, you need to have both read and execute permissions on the folder. Similarly, to create new files or folders in a folder, you have to have both write and execute on the folder. So, if you’re going to give any kind of permission to a folder, you have to give execute permission too, or it won’t work.
OK, now for some best practices. If you need to add entries for multiple individual users, it’s usually much better to add an entry for a group that has the right members in it. So, for example, if you wanted the Engineering team to have access to a folder, then you would just assign permissions to the Engineering group instead of all of the individual Engineering users. Aside from this being easier to set up and maintain, it’s often necessary because you can only add nine custom entries to an ACL.
It’s also a good idea to set default ACLs. When you apply default ACLs to a folder, then every file or folder that gets created in it will have those ACLs too. Default ACLs have exactly the same structure as regular ACLs. Here’s how to configure them.
Go back to the access dialog for the example folder. Notice that there’s a Default checkbox under the access permissions. To assign the same permissions by default to all new files and folders in this folder, check this box, and it conveniently sets the default permissions to be the same as the access permissions for the folder itself. You can change these to whatever you like, of course, but we’ll just leave them. When you click on the other groups and users, you’ll see that the default permissions have been set appropriately for all of them. Click Save when you’re done.
Note that if you set a default ACL on a folder that already has some files and folders in it, then those existing files and folders won’t inherit the default permissions. That’s because the default permissions are only copied when a new file or folder is created.
Now that you know how both roles and ACLs work, you might be wondering what happens if you use both of them. For example, what if you assigned the contributor role to Bob for a filesystem and then configured an ACL to give no permissions to Bob for a folder. Would Bob have access to that folder or not? Well, Bob would still have full access to that folder because an ACL can’t take permissions away from a role. You can use an ACL to add permissions, though. For example, if you assigned the reader role to Bob and then configured an ACL to give him write permission for a folder, then he would have write permission on that folder.
OK, that’s it for access control.
The third layer of security is network isolation. You can actually set up a firewall just for your data lake. Select Firewalls and virtual networks in the Settings menu. The default is to allow access from all networks. If you click Selected networks, then a whole bunch of other configuration options appear.
First, you can enable access from specific virtual networks. Second, you can allow access from particular IP addresses.
If you want to access your data lake using other Azure services, such as Azure Backup, then you can make an exception by checking this box. Another couple of possible exceptions are if you want to allow read access to storage logging and metrics from any network.
Now that you know how the firewall works, I recommend turning it off to avoid any potential access issues while you’re learning how to use ADLS.
The fourth layer of security is data protection. ADLS supports encryption of data both at rest and in transit. Data in transit is encrypted using HTTPS by default. Data at rest is also encrypted automatically. The only decision you have to make is whether to use a Microsoft-managed key, which is the default, or use your own customer-managed key.
If you want to change it, click on Encryption, then check the Use your own key box. This gives you two different ways of specifying a key that’s stored in Azure Key Vault.
The fifth layer of security is Advanced Threat Protection. If you enable this, it will watch for attempts to access or exploit your storage accounts. If any suspicious activities are detected, then it will send you alerts through Azure Security Center.
The sixth layer of security is auditing. ADLS logs all account management activities. To see them, click “Activity log”. Since we haven’t done much yet, there aren’t many entries.
And that’s it for security.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).