image
Optimizing Delta Lake
Start course
Difficulty
Intermediate
Duration
13m
Students
2021
Ratings
5/5
Description

Delta Lake is an open-source storage layer that’s included in Azure Databricks. It supports structured and unstructured data, ACID transactions, and batch and stream processing. This course provides an overview of Delta Lake, including some history of earlier data solutions and why you might choose Delta Lake instead. You'll learn how to use and optimize Delta Lake for your own workloads.

Learning Objectives

  • Understand what Delta Lake is and what it's used for
  • Learn how to optimize Delta Lake

Intended Audience

This course is intended for anyone who wants to learn how to use Delta Lake on Azure Databricks.

Prerequisites

To get the most from this course, you should already have some knowledge of Apache Spark and Azure Databricks. If you’re not familiar with those, then you should take our Running Spark on Azure Databricks course. It would also be helpful to have some experience with SQL.

Transcript

There are two important ways to make queries on a Delta table run faster. The first is to partition the data properly. Ideally, you want to distribute a table’s data across partitions in such a way that queries will read data from many partitions rather than from just one or two. Reading data from many partitions at the same time takes advantage of parallelism to speed up the query.

To distribute a table properly, you need to choose a good column to act as the partition key. A bad choice would be a column with high cardinality, meaning one that has a large number of unique values. A classic example of this is a user ID column, which would have a different value for every single row because each user needs to have a unique user ID. Using this column as a partition key would result in too many partitions.

The most commonly used type of column for partitioning is a date column. There will typically be many rows for each date, and the number of rows per date should be fairly similar.

Another way to partition is by the amount of data. If using a particular column as a partition key will result in at least 1 gigabyte per partition, then that could potentially be a good choice.

The second important way to make queries faster is to combine smaller files together into larger files. When you write data to a Delta table, it’ll store that data in one or more new files in each partition. This happens each time you write data to that table, so over time, you’ll end up with a large number of files in each partition. This will slow down reads because it has to open so many files.

To compact these numerous small files into fewer large files, you can use the OPTIMIZE command. Alternatively, you can enable the Auto Optimize option on a table. This will automatically optimize the table after every write. So why wouldn’t you always use Auto Optimize instead of manually running the OPTIMIZE command all the time? Well, because there are also disadvantages to optimizing tables. First, optimizing a table uses a lot of resources, so it’s an expensive operation. Second, it slows down queries while it’s running. Of course, this defeats the purpose of optimizing, so you may want to schedule OPTIMIZE to run when you’re not executing queries that require a fast response time. Typically, OPTIMIZE is run once a day.

There’s yet another thing you have to keep in mind when you optimize your tables. It leaves all of those smaller files lying around. Over time, those files will take up a significant amount of space, so it would be best to get rid of them. You can do that by running the VACUUM command.

But, once again, there’s a potential disadvantage. Delta Lake needs those files when you want to read data from a previous version of the table. That’s why you need to set the deletedFileRetentionDuration for the table. This setting tells the VACUUM command which files it should delete and which files it shouldn’t. For example, if you set it to 30 days, then it won’t delete files that are 30 days or newer. That way, you’ll be able to time travel to a version of the table that’s up to 30 days old. You’ll also need to set the logRetentionDuration to 30 days because you need both the logs and the files in order to time travel.

And that’s it for optimizing Delta Lake.

About the Author
Students
217313
Courses
98
Learning Paths
164

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).