Working With Azure Databricks
Monitoring and Optimization
The course is part of these learning paths
Azure Data Lake Storage Gen2 (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
Data Lake Storage Gen2 is built on top of Blob Storage. This gives you the best of both worlds. Blob Storage provides great features like high availability and lifecycle management at a very low cost. Data Lake Storage provides additional features, including hierarchical storage, fine-grained security, and compatibility with Hadoop.
The most effective way to do big data processing on Azure is to store your data in ADLS and then process it using Spark (which is essentially a faster version of Hadoop) on Azure Databricks.
In this course, you will follow hands-on examples to import data into ADLS and then securely access it and analyze it using Azure Databricks. You will also learn how to monitor and optimize your Data Lake Storage.
- Get data into Azure Data Lake Storage (ADLS)
- Use six layers of security to protect data in ADLS
- Use Azure Databricks to process data in ADLS
- Monitor and optimize the performance of your data lakes
- Anyone interested in Azure’s big data analytics services
- Experience with Azure Databricks
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/azure-data-lake-gen2.
Data warehouses have been around for decades, but the term “data lake” was only coined in about 2011. While data warehouses store data in structured, relational tables, data lakes store any kind of data, whether it’s structured or not. For example, you could store everything from documents to images to social media streams.
Data warehouses are generally used for business reporting, while data lakes are more often used for data analytics and exploration. In fact, one common setup is to process data in the data lake and then export it to the data warehouse.
Microsoft actually has two different services for storing unstructured data: Blob Storage and Data Lake Storage. These used to be completely separate services when Data Lake Storage was in its first generation. But Microsoft decided to build Data Lake Storage Gen2 on top of Blob Storage.
They did this to get the best of both worlds. Blob Storage provides great features like high availability and lifecycle management at a very low cost. Data Lake Storage provides additional features, including hierarchical storage, fine-grained security, and compatibility with Hadoop.
Hierarchical storage simply means that data is organized into a tree of folders and files similar to what you would see on a Windows or Linux system. In fact, both Azure Data Lake Storage Gen2, which I’m going to call ADLS for short, and Linux are POSIX-compliant, so if you’re familiar with Linux filesystems, you’ll be very familiar with how ADLS filesystems work.
If you have some experience with Blob Storage, you might be wondering why it’s not considered hierarchical. After all, blobs are often organized in a structure that seems to include folders and subfolders. However, that’s simply a naming convention. You can put slashes in your blob names to simulate a tree structure, but they’re really just files in a flat structure.
One advantage of a true hierarchical filesystem is performance. It’s designed to perform operations on folders, so it can do so very quickly. For Blob Storage to operate on a simulated folder, it has to perform a separate operation on each file.
Another advantage is fine-grained security. Blob Storage can only restrict access at the container level rather than at the individual blob level, whereas ADLS can set permissions at the individual file or directory level. In case you’re wondering what a directory is, that’s the POSIX term for a folder. They’re the same thing, and you’ll see both terms in Microsoft’s ADLS documentation.
I should also point out that the ADLS equivalent of a container is a filesystem, so if you see a reference to a container in ADLS, it just means a filesystem.
The biggest advantage of Data Lake Storage being POSIX-compliant is that it can act as a replacement for the Hadoop Distributed File System, or HDFS. This means that ADLS can seamlessly integrate with the huge ecosystem of Hadoop software. The most effective way to do big data processing on Azure is to store your data in ADLS and then process it using Spark (which is essentially a faster version of Hadoop) on Azure Databricks.
And that’s it for the overview.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).