Delta Lake on Azure Databricks
The course is part of these learning paths
Delta Lake is an open-source storage layer that’s included in Azure Databricks. It supports structured and unstructured data, ACID transactions, and batch and stream processing. This course provides an overview of Delta Lake, including some history of earlier data solutions and why you might choose Delta Lake instead. You'll learn how to use and optimize Delta Lake for your own workloads.
- Understand what Delta Lake is and what it's used for
- Learn how to optimize Delta Lake
This course is intended for anyone who wants to learn how to use Delta Lake on Azure Databricks.
To get the most from this course, you should already have some knowledge of Apache Spark and Azure Databricks. If you’re not familiar with those, then you should take our Running Spark on Azure Databricks course. It would also be helpful to have some experience with SQL.
One of the great things about Delta Lake is how easy it is to use. For the most part, all you have to do is specify “delta” as the file format in your Apache Spark code. The Delta file format consists of versioned Parquet files plus a transaction log. Parquet is a much more efficient file format than CSV or JSON, so it makes sense that the Delta format would use it.
To create a Delta table with a SQL statement, you just need to use “delta” as the file format instead of “parquet” or “csv” or “json”. Similarly, to create a Delta table with a Python statement, use “delta” for the format.
To read and write streaming data, again, you just need to specify “delta” as the format in your Structured Streaming code. For example, suppose you need to transfer streaming data from a bronze table to a silver table. To read the stream from the bronze table, you’d use the readStream method with the delta format. Then to write it to the silver table, you’d use the writeStream method.
There are a couple of extra details in this command. First, the outputMode is set to “append”. That, of course, means that the new data will be added to the table rather than overwriting any existing data in the table.
Second, there’s an option to set the checkpoint location. This is the directory where it will store the current state of the streaming job. The advantage of doing this is that if the streaming job gets interrupted for some reason, you can restart it, and it’ll pick up from where it left off instead of starting from scratch. This is a feature of Spark’s Structured Streaming engine.
If you want to see which streams are currently running, you can use “spark.streams.active”.
Now let’s get into a couple of features that are unique to Delta Lake. First is the ability to do upserts. When you do an upsert, it checks to see whether a previous version of the row already exists. If it doesn’t, then the row is added. If it does, then the row is updated. Normally, data lakes don’t support this capability, but Delta Lake does.
The command to do an upsert in Delta Lake is not called upsert, though. It’s called “merge”. Here’s the syntax if you’re using SQL. In this example, rows of users with new addresses are in a table called “updates”, and they get merged into a table called “users”. “WHEN MATCHED” means there’s already a row for a particular user in the users table, so the address for that user is overwritten with the new address. “WHEN NOT MATCHED” means there isn’t already a row for a particular user in the users table, so the new row is added.
You can even use the MERGE command to delete rows. This is different from an upsert, of course, but you still use the MERGE command to do it. For example, if you have a list of users who would like to be removed from your database, you can use this command to remove them from a table called “users”.
Another great Delta Lake feature is called time travel. Every time a delta table is updated, a new version is created, but the old version is still kept. Then if you need to read data from a previous version, you can specify either the version number or a timestamp. For example, to retrieve the June 13th version of a particular user’s record, you can use this SQL statement. The only part that’s different is the “TIMESTAMP AS OF” clause. Note that I’ve only put a date in this timestamp, but you could put a date and a time in it if you needed to be more precise. If you wanted to use a version number instead of a timestamp, then you’d use the “VERSION AS OF” clause instead. To get the version number, use the “DESCRIBE HISTORY” command.
If you need to revert a table back to a previous version, you can easily do that using the RESTORE command.
Bear in mind that, by default, Delta tables only keep the commit history for 30 days. So if you want to see versions further back than that, you’ll have to change both the logRetentionDuration and the deletedFileRetentionDuration for the table. I’ll talk about the need for the second setting in the next lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).