Working With Azure Databricks
Monitoring and Optimization
The course is part of these learning paths
Azure Data Lake Storage Gen2 (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
Data Lake Storage Gen2 is built on top of Blob Storage. This gives you the best of both worlds. Blob Storage provides great features like high availability and lifecycle management at a very low cost. Data Lake Storage provides additional features, including hierarchical storage, fine-grained security, and compatibility with Hadoop.
The most effective way to do big data processing on Azure is to store your data in ADLS and then process it using Spark (which is essentially a faster version of Hadoop) on Azure Databricks.
In this course, you will follow hands-on examples to import data into ADLS and then securely access it and analyze it using Azure Databricks. You will also learn how to monitor and optimize your Data Lake Storage.
- Get data into Azure Data Lake Storage (ADLS)
- Use six layers of security to protect data in ADLS
- Use Azure Databricks to process data in ADLS
- Monitor and optimize the performance of your data lakes
- Anyone interested in Azure’s big data analytics services
- Experience with Azure Databricks
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
The GitHub repository for this course is at https://github.com/cloudacademy/azure-data-lake-gen2.
Once you have a significant amount of activity in your Data Lake Storage, you’ll want to monitor it to make sure it is working as expected. It’s also important to know how to optimize its performance.
Azure Monitor provides great tools for keeping track of metrics and sending alerts. I’m not going to go into them in great detail because Azure Monitor works essentially the same way for all services. I’ll just show you some highlights.
If you scroll down to the Monitoring section in the menu on the left, you’ll see standard Azure Monitor options like Alerts and Metrics. It also has Insights, which is something that’s not available for all services.
It shows a collection of graphs about various aspects of how your storage account is operating. These two in the middle are pretty important. The one on the left shows the availability. This is a very boring graph because it’s just a straight line at 100%. Boring is definitely good in this case.
The one on the right shows how much space you’ve used in this storage account. Right now, it’s only showing the last four hours, but we can change that up here. I’ll change it to 48 hours. Check out the used capacity graph now. It looks like a series of steps, doesn’t it? That’s actually pretty typical for this graph because most organizations use more and more storage as time goes on.
You can also drill down and get more details in a particular area, such as Failures, by clicking one of the buttons up here.
Now, let’s have a look at Alerts. To create a new one, click here. This storage account is already selected as the resource. The condition is where you’ll see details that are unique to storage accounts. You can choose to watch metrics like used capacity and availability or activity log entries like Storage Account Failover.
All right, that’s it for monitoring. Now let’s move on to optimization, starting with optimizing upload performance.
When you’re transferring large amounts of data from your local infrastructure to the Azure cloud, there are three potential bottlenecks. First is the speed of your storage. Ideally, your local data should be on SSDs rather than spinning disks, and on storage arrays rather than individual disks.
Second, you should have a high-speed internal network. In particular, the network interface cards on your local machines should be as fast as possible.
Third, the network connection between your local infrastructure and the Azure cloud should be fast. If it’s a major bottleneck, then consider using a dedicated link with Azure ExpressRoute.
If your data source is also in Azure, then put it in the same region as the data lake, if possible.
Finally, configure your data ingestion tools for maximum parallelization.
The next area to look at is how the datasets in your data lake are structured. When your data is being processed, there’s a per-file overhead, so if you have lots of small files, it can impact the performance of the job. If possible, your files should be at least 256 meg in size.
If you’re processing time-series data, such as server metrics or stock market prices, then you can make your queries run faster if you partition the data by time period. That way, your queries can read in only the specific subset of the time series that’s needed. For example, you could have a folder structure that looks like this.
And that’s it for monitoring and optimization.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).