Working With Azure Databricks
Monitoring and Optimization
The course is part of these learning paths
Azure Data Lake Storage Gen2 (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
Data Lake Storage Gen2 is built on top of Blob Storage. This gives you the best of both worlds. Blob Storage provides great features like high availability and lifecycle management at a very low cost. Data Lake Storage provides additional features, including hierarchical storage, fine-grained security, and compatibility with Hadoop.
The most effective way to do big data processing on Azure is to store your data in ADLS and then process it using Spark (which is essentially a faster version of Hadoop) on Azure Databricks.
In this course, you will follow hands-on examples to import data into ADLS and then securely access it and analyze it using Azure Databricks. You will also learn how to monitor and optimize your Data Lake Storage.
- Get data into Azure Data Lake Storage (ADLS)
- Use six layers of security to protect data in ADLS
- Use Azure Databricks to process data in ADLS
- Monitor and optimize the performance of your data lakes
- Anyone interested in Azure’s big data analytics services
- Experience with Azure Databricks
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/azure-data-lake-gen2.
I hope you enjoyed learning about Azure Data Lake Storage. Let’s do a quick review of what you learned.
ADLS Gen2 is primarily used for data analytics. It’s built on top of Azure Blob Storage and adds these features: hierarchical storage, fine-grained security, and compatibility with Hadoop. The only thing you have to do to convert a Blob Storage container to Data Lake Storage is to enable the Hierarchical namespace option.
ADLS has six layers of security. For authentication, you can use Azure Active Directory, a Shared Access Signature, or a Shared Key. For access control, it has roles for user management and access control lists to restrict access to data. If you need to assign the same permissions to multiple users, then you should create ACL entries for groups rather than individuals. Also, to ensure consistent permissions, set default ACLs on folders, when possible.
To achieve network isolation, you can use the firewall to restrict access to particular virtual networks and/or IP addresses. ADLS handles data protection by always encrypting data in flight using HTTPS and giving you a choice of Microsoft-managed keys or customer-managed keys to encrypt data at rest.
The fifth layer of security is Advanced Threat Protection. If you enable this, it will watch for attempts to access or exploit your storage accounts.
The final layer of security is auditing. All account management activities are logged for later inspection.
You can import data into ADLS from a wide variety of sources, including other Azure services, such as Data Factory, Databricks, and Stream Analytics, as well as from your local infrastructure using AzCopy, Storage Explorer, PowerShell, or the Azure CLI.
The best way to process data in ADLS is by using Spark on Azure Databricks. There are three different ways to authenticate Databricks to ADLS. The easiest way, called credential passthrough, lets you use your own Azure Active Directory credentials. The second way is to assign a service principal to your Databricks instance. The third method is to embed the storage account access key in your Spark code on Databricks, which is definitely not recommended.
Azure Monitor provides the tools you need for keeping track of metrics and sending alerts.
To optimize your performance, use high-speed storage and networking for data ingestion; use files that are at least 256MB in size; and partition time series data so you can process subsets of it.
Now you know how to get data into Azure Data Lake Storage (ADLS); use six layers of security to protect data in ADLS; use Azure Databricks to process data in ADLS; and monitor and optimize the performance of your data lakes.
To learn more about Azure Data Lake Storage, you can read Microsoft’s documentation. Also watch for new big data courses on Cloud Academy, because we’re always publishing new courses.
Please give this course a rating, and if you have any questions or comments, please let us know. Thanks and have fun with Data Lake Storage!
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).