Delta Lake on Azure Databricks
The course is part of this learning path
Delta Lake is an open-source storage layer that’s included in Azure Databricks. It supports structured and unstructured data, ACID transactions, and batch and stream processing. This course provides an overview of Delta Lake, including some history of earlier data solutions and why you might choose Delta Lake instead. You'll learn how to use and optimize Delta Lake for your own workloads.
- Understand what Delta Lake is and what it's used for
- Learn how to optimize Delta Lake
This course is intended for anyone who wants to learn how to use Delta Lake on Azure Databricks.
To get the most from this course, you should already have some knowledge of Apache Spark and Azure Databricks. If you’re not familiar with those, then you should take our Running Spark on Azure Databricks course. It would also be helpful to have some experience with SQL.
You’ve probably heard of a data lake, but you may be wondering what a delta lake is. To explain that, I’ll need to take you through a little bit of history. A few decades ago, relational databases were in widespread use for transaction processing. They supported what are known as ACID transactions. ACID stands for atomicity, consistency, isolation, and durability. These are properties that ensure data integrity, which is very important if you’re dealing with, for example, financial transactions.
Then organizations wanted to do reporting on the data in these systems, so data warehouses were created to gather data from lots of databases into one central system that was optimized for running queries on large amounts of structured data.
Then in the 2000s, organizations started generating huge amounts of unstructured data as well. This data didn’t fit the paradigm of data warehouses, so the first data lakes were created in 2010. Data lakes can handle both structured and unstructured data.
But, of course, there were eventually problems with data lakes, too. One of the biggest problems was a lack of data integrity because data lakes didn’t support ACID transactions. Another problem is that it’s difficult to handle both batch and streaming data processing jobs.
Batch processing is normally performed on a large batch of data all at once, which can take a long time. In contrast, stream processing is performed continuously as data streams in. Thus, batch processing provides historical data that’s accurate but old, and stream processing provides data that’s incomplete but available immediately.
One attempt to combine batch and streaming data is called the lambda architecture. It uses separate pipelines for batch and stream processing. These pipelines are usually implemented using different tools. Then each system is queried separately, and the data is combined into a unified report by yet another system. As you can see, one of the biggest problems with the lambda architecture is complexity.
To solve these and many other problems, the lakehouse architecture was developed. It’s an attempt to combine the best elements of data warehouses and data lakes. It supports both structured and unstructured data, it supports ACID transactions to ensure consistency, and it supports both batch and stream processing.
Okay, now I can finally tell you about Delta Lake. It’s an open source storage layer that makes it much easier to build a lakehouse architecture. It supports the three features I just mentioned, along with many other useful features. And it’s compatible with Apache Spark, which means it’ll work seamlessly with Azure Databricks. In fact, it’s already included in Azure Databricks, so you don’t need to install anything.
The way it supports both batch and streaming is by having a table format that can accommodate both types of data. That is, you can load the table with historical batch data and then stream real-time data into it as well. Both types of data can be queried from the same table.
One very useful way of organizing your data is to have three types of tables, which are often called bronze, silver, and gold. Bronze tables contain raw, unprocessed data. This data is then cleaned and processed into a more useful form. This refined data is stored in Silver tables. Then the refined data is aggregated into a form that would be suitable for business reporting, and it gets stored in Gold tables.
I’m sure you can see why it would be useful to organize your data this way, but there are a couple of points about Delta Lake storage that make it even better. First, every transaction at every stage is an ACID transaction, so you can be sure of the integrity of the data in all three sets of tables. Second, you can query the data at any stage. So, for example, if you need to get the latest real-time data, then you can query the Bronze tables. Or if you want to look for something in the data that isn’t on the business reports, then you can query the Silver tables.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).