Storage & Compression
The course is part of this learning path
SQL is the language for interacting with databases, and like all computer languages, there are many ways to achieve the right result. And like other software languages, some ways will be better and more efficient than others. What sets SQL and database programming apart from most other computer programming is the vast amount of data that is regularly dealt with, and how other factors that are not immediately obvious can impact the SQL query code you write.
This course looks at ways that you can see how the SQL Server engine views your queries and how it intends to execute them. Using that information, we look at other factors like table and index design that can positively or negatively impact your query execution.
For any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Understand what impacts query performance when using SQL
- Learn how the SQL Server engine processes queries
- Use execution plans to see how a query is executed
- Instruct the query engine
- Understand how different query types can be used
- Learn how to design tables to make queries more efficient
- Use database file configuration to improve performance
- Understand how database compression can impact performance
This course is aimed at database administrators, data engineers, or anyone who wants to learn how to improve database query execution.
To get the most out of this course, you should be familiar with SQL query syntax, SQL data types, and SQL table structure.
So far, we've looked at how tables are designed and how indexes can improve performance, but we haven't looked at how the data is physically stored in files on hard drives. Unless you've been involved with very large databases you're probably most used to seeing two files on disk when looking at an SQL Server database, an MDF file for the data and an LDF log file.
By default, both of those files are placed in the same directory when creating a new database. This is just the default behavior and there is nothing stopping you from creating additional data files for your tables. If your environment only has one physical drive then the reason to split your database into multiple files will most likely be in the context of easier maintenance, especially if your database is very large. However, if you have multiple drives in your environment you can get some performance increase from splitting your data across those drives.
SQL Server is optimized for parallel execution and simultaneous reading or writing to disk will definitely improve performance. You could think of it as a form of low-tech RAID. In addition to database file partitioning there is the concept of table partitioning which makes use of the multiple database files to improve performance and reduce contention.
The next thing I want to do is try various combinations of database and table partitioning with some of the index strategies we've already looked at to see how I can further improve query performance. I am going to start with a new database that will have the standard MDF and LDF files as well as eight additional NDF files spread across two physical drives.
You can create the file groups as I've done using the create database script and the filegroup specification, or you can add files and filegroups through the database's property page in SQL Server Management Studio. Filegroups can be added in the filegroups page and then you can assign a file to that filegroup in the files page via the filegroup drop-down list.
Assigning a table to a filegroup is reasonably straightforward just specify the filegroup name at the end of a table's or index's create statement. Where things become a little more interesting and complicated is partitioning a single table over multiple filegroups. It is the allocation of records that is dynamic and there needs to be a function that tells SQL Server where you want records to be placed.
The first thing you need to do is create a partition function that takes the data type of the field that you want to partition your data on as a parameter, then you specify the values that form the boundaries of those partitions. As you are only specifying the boundary value you need to also specify what happens if an inserted value is the same as the boundary value.
This is definitely a situation where you cannot sit on the fence. It is the keyword left or right after range that specifies where a boundary value should go. For example, if I logged a date value of 1st January 2007 it will go to the left of that boundary, so into partition 1. The partition scheme maps the filegroups to the partition boundaries. The table create statement now uses the partition scheme name with the field to be partitioned on as the table's location.
Obviously, as time goes on and data is accumulated, more partitions will be needed. There is absolutely no problem with adding partitions and modifying the partition function as the data grows. You can use the alter partition function you split and merge ranges. You can still use table partitioning without multiple filegroups like with an Azure SQL database, by directing all partitions to the primary filegroup.
While horizontal partitioning involves special features and functions vertical partitioning is the process of splitting the columns of one table into multiple tables. As with horizontal partitioning, these tables can be located in separate files on different hard drives thereby benefiting from parallel disk access. This model does have high maintenance costs and that all insert and delete queries and some update queries will need to target multiple tables instead of one.
Also, select statements may need to target multiple tables. If an auto-increment identity key is used, then this will also require management. In a normal operational situation, you would let the partition function take care of the record inserts. I found that inserting 13 ½ million records did take a long time, so I investigated other ways of doing it.
The alter table switch to partition statement did seem interesting with the promise of instantaneously putting data from a staging table directly into an empty table partition. It turns out that the preparation to use the partition switch statement really negates the speed of it, but it has a useful application in instantly replacing, or populating production data from the staging table.
The staging table needs to be in the same partition as the one that it is switching to, and that staging table needs to have a constraint with check that enforces the partition function. To insert records that have been logged between 01/01/2007 and 01/09/2008 which corresponds to partition 2, I need to set up a table in partition 2 that has the same primary index as the destination table, with a constraint.
So, what does this look like? Well, we have the staging tables that have been created in the partition that their data relates to, they have the same primary clustered index key as the destination table. And they all have the constraints on the logged field to match the partition function. Here we have the insert statements which are populating the staging tables using the index that we've seen before on the logged field. And finally, the actual switch statements.
This does seem like a lot of preparation, and you could insert directly into the destination table using multiple insert statements and the partition field in your predicate. Breaking the insert statements up corresponding to the partition schema turns out to be just as quick, but in a production environment, you may have some time when your destination table is temporarily unavailable due to the insert operation.
This is the problem that the switch to partition statement overcomes. If this was something you are going to do regularly then you could build dynamic SQL statements using the system partition tables. This query would allow you to dynamically create your staging tables with constraints, and dynamically build the corresponding insert statements, finishing off each cursor loop switching in the partition data.
For the normalized tables I took a quick and dirty approach. First of all, I needed to define another partition function for the Tranheader table even though the partition specification is the same. The date field in the Tranheader table is a datetime2. I also needed to construct a partition function for the TranLine table using the TranHeaderID.
As you can imagine, dealing with auto-increment identity fields is somewhat problematic when it comes to partition functions, and I definitely would not recommend doing this in a production environment, but I did what I had to do, so I could compare the partition performance of the two different data structures.
Here are the actual insert statements which are going straight into the partition destination table, but using the same boundary aligned queries.
Let's see how our query is run against our horizontally partitioned tables. Starting with our flat SalesData table using a rowstore index we get 18 seconds, so we can say, there is no appreciable performance increase from horizontal partitioning alone. You can right-click and view the properties of each of these indexes under storage and see where they are located.
The primary key is located in the partition scheme whereas the non-clustered index is located on the primary partition. Through experimenting, I did find that having the non-clustered index located in one partition is opposed to spreading it across multiple partitions with the related rows was marginally faster.
When we look at the normalized tables, again we see that the horizontal partitioning has made little difference in speed, it's perhaps a fraction faster, but not enough to warrant the effort. Before we move on I'll just have a look at the relative table sizes, and I can tell you that nothing much is changed, the sales data flat table structure is still just over 3 gigabites, while the normalized tables are still at 1 .5 gigabites.
The next question is how do partition tables work with column store indexes. Well, I will remove the row store indexes and recreate the columnstore index and try it out. I will need to change the partition for the column store index to use the partition function. Having created the column store index let's run the query.
Okay, so a partitioned flat table structure with column store index performs almost half as well as a column store index without partitioning. Let's try the normalized table structure with a column store index. 1.4 seconds is a good result. The best result I could get was by creating a temporary table that combined all the product characteristics called mobileitems, which got me down to 1.25 seconds.
One last thing to check is the relative dataset sizes, and we can see here that sales data remains at around 2 gigabytes, whereas the normalized structure is now under 960 megabytes. I'm not sure how you could interpret this, whether the normalized structure is twice as efficient because it's twice as fast, or four times more efficient because it's twice as fast and half the size.
Before we leave table partitioning, let's have a quick look at how a vertical partition might work with the SalesData table. I've created two tables SaleDataV1 and SaleDataV2 that are located on partition 1 and partition 2 respectively. The two tables are joined by a single ID integer field, and I've created a non-clustered index on SalesDataV1 keyed on the same fields that we have used before, logged, salesperson, and total before Ttx. Under a traditional rowstore index this query takes around 17 seconds to complete.
Now, if I change my row store index to a column store index and rerun the query, we get a significant improvement, almost tenfold as it completes and under two seconds. However, the overall dataset size still remains at just over 2 gigabytes.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.