Delta Lake on Azure Databricks
The course is part of these learning paths
Delta Lake is an open-source storage layer that’s included in Azure Databricks. It supports structured and unstructured data, ACID transactions, and batch and stream processing. This course provides an overview of Delta Lake, including some history of earlier data solutions and why you might choose Delta Lake instead. You'll learn how to use and optimize Delta Lake for your own workloads.
- Understand what Delta Lake is and what it's used for
- Learn how to optimize Delta Lake
This course is intended for anyone who wants to learn how to use Delta Lake on Azure Databricks.
To get the most from this course, you should already have some knowledge of Apache Spark and Azure Databricks. If you’re not familiar with those, then you should take our Running Spark on Azure Databricks course. It would also be helpful to have some experience with SQL.
Let’s do a quick review of what you’ve learned.
Delta Lake is an open source storage layer that’s included in Azure Databricks. It supports structured and unstructured data, ACID transactions, and batch and stream processing.
One useful way of organizing your data in a delta lake is to have three types of tables, which are often called bronze, silver, and gold. Bronze tables contain raw, unprocessed data. Silver tables contain refined data. Gold tables contain aggregated data.
To use Delta Lake, specify “delta” as the file format in your Apache Spark code.
Delta Lake supports upserts with the “merge” command.
It supports time travel, that is, querying an older version of a table, with the “TIMESTAMP AS OF” and “VERSION AS OF” clauses. Use the RESTORE command to revert a table back to a previous version. By default, Delta tables only keep the commit history for 30 days, so if you want to see versions further back than that, you’ll have to change both the logRetentionDuration and the deletedFileRetentionDuration for the table.
Two ways to make queries on a Delta table run faster are partitioning and optimizing. The most commonly used type of column for partitioning is a date column. The OPTIMIZE command compacts smaller files into larger files. To remove the files that are no longer part of the table, use the VACUUM command. Set the deletedFileRetentionDuration for the table to prevent VACUUM from deleting files that you need for time travel.
If you want to try Delta Lake out for yourself, I recommend importing the notebooks at this URL into your Azure Databricks workspace. You can find the URL in the transcript below.
Please give this course a rating, and if you have any questions or comments, please let us know. Thanks!
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).