Storage & Compression
The course is part of this learning path
SQL is the language for interacting with databases, and like all computer languages, there are many ways to achieve the right result. And like other software languages, some ways will be better and more efficient than others. What sets SQL and database programming apart from most other computer programming is the vast amount of data that is regularly dealt with, and how other factors that are not immediately obvious can impact the SQL query code you write.
This course looks at ways that you can see how the SQL Server engine views your queries and how it intends to execute them. Using that information, we look at other factors like table and index design that can positively or negatively impact your query execution.
For any feedback relating to this course, please contact us at email@example.com.
- Understand what impacts query performance when using SQL
- Learn how the SQL Server engine processes queries
- Use execution plans to see how a query is executed
- Instruct the query engine
- Understand how different query types can be used
- Learn how to design tables to make queries more efficient
- Use database file configuration to improve performance
- Understand how database compression can impact performance
This course is aimed at database administrators, data engineers, or anyone who wants to learn how to improve database query execution.
To get the most out of this course, you should be familiar with SQL query syntax, SQL data types, and SQL table structure.
Indexes, or indices if you want to be traditional and some would say correct, come in a variety of types. There are four types of indexes I want to look at; clustered, non-clustered, filtered, and column store.
Now you may have noticed next to the table scan in the plan graphic the word heap in brackets. A heap, when referring to a table, means that the rows are in no particular order and generally default to the order they were inserted in. A clustered index is when the table physically stores the rows in a specified order.
Unlike other indexes, a clustered index is not separate to the table and obviously you can only have one clustered index per table, the records can only be stored in one order at a time. Clustered indexes are commonly associated with primary keys and are typically unique, but this is not a requirement. They are often a unique identifier with no intrinsic meaning, but again, this is not necessarily always the case, they can also be compound indexes.
A clustered index is good when you are mainly reading from the table and are typically searching or reading on that index. You typically won't find clustered indexes on tables where there is a lot of change or movement, including inserts on the fields that make up a clustered index.
Constantly having to rearrange the physical rows of the table when new records are inserted, deleted, or updated is generally inefficient and too much of an overhead. This is another reason why unique identifiers like an identity key are used for clustered indexes as each newly inserted record will automatically be in the correct order. When doing bulk inserts of large numbers of records into a table with a clustered index, it is often, but not always more efficient to remove a clustered index, insert the records and then reapply it.
A non-clustered index is the most common type of index, just for the reason that you can have almost as many as you like on a table. The index is a physically separate entity to the table which does mean that the more indexes you have will require more disc space and more resources when doing inserts or updates as all indexes that reference the fields in question also need to be updated.
Not surprisingly when a table has a clustered index, that is it's not a heap, the non-clustered index will access the table via the clustered index, the default order of the rows. An index in a database is similar to an index in a book. It is separate from the content of the book, but it tells you where the information you want is.
Looking at this example from our earlier query plan where I forced the query to use the Customer primary key, we can see that a lot of the work is actually referencing back to the table heap to find the row related to the index.
Just to carry on with the book analogy, SQL Server stores its data on pages, with each page being eight kilobytes in size. In the case of the Customers' table where the average record size is around 230 bytes, then you can fit 35 records onto a page more or less.
In the case of the primary key index, we have the code field of varchar plus the row ID pointer to the related records on the customer heap. The eight-byte row ID is comprised of a two-byte pointer to the file ID, then a four-byte pointer to the page, and a two-byte pointer to the record's position on the page, called the slot.
When I said a non-clustered index accesses a table with a clustered index through the clustered index, this means that the row ID field in the non-clustered index is replaced by the clustered index value.
A couple of other things I want to point out is that using varchar in an index means that the specified size is allocated for each index entry as you can't have variable length fields in an index. This is definitely something to bear in mind when dealing with large tables and very variable length data in your varchar field. In this case, we have a maximum length which is 50% bigger than the average, that is 18 versus 12 bytes.
Another thing I want to draw your attention to is that the average record size of the Customers' table is around 230 bytes whereas the estimated row size in the query plan graphic said 43 bytes. This is quite a discrepancy, even if it's a known one.
Say you decide you want a clustered index on three varchar columns that are respectively 25, 10, and 30 in length that uniquely identify a data row. This would have a substantially negative impact on the size of all non-clustered indexes on this table. Each non-clustered index would have the extra baggage of 65 minus eight bytes as the row ID field is replaced with the clustered index value.
You can have a non-clustered index with included columns. This means that additional columns that you are interested in, but you don't need the data ordered by, can be included in the index. By doing this, SQL Server does not need to go to the table heap to get the additional fields.
Essentially included column indexes could be likened to mini tables. This can definitely improve performance but at the cost of essentially duplicating data and consuming disc space. Logic would dictate that there must be a point where including too many additional columns will work against performance by reducing the number of index rows that can fit onto a page.
Depending on the size of the table and the variety of queries executed against it in terms of columns to be pulled back, choosing the included columns should be a carefully considered task. Looking at our Customers' table example again, we would need to include FirstName and LastName in our index to satisfy our query requirement. This would negate the need to go to the customer heap for those columns, but would substantially increase the index record size, reducing the number of rows that could fit onto the index page to a less than 3 to 1 advantage over the customer heap.
The query plan's suggested missing index on the TransLines table includes the columns from the output list of the TransLines heap's properties. It's quite easy to understand why the optimizing engine came up with this particular index as an optimization. TransLines has the most records and it is the only table in the query with the predicate, so this is where the biggest impact would be made.
A non-clustered index can also be augmented with a filter. This means your index only contains a subset of the possible index values. In the example query where the only where clause is UPC is not equal to TPAY2, we could make an index that excludes all TPAY2 values. This is an example of the divide and conquer strategy where your index is reduced in size to a subset of values you are interested in thereby increasing seek and scan speed.
Now you may be thinking this seems a little contrived, but this query is a good example of where a filtered index would be useful. The data is mobile phone and accessory sales. TPAY2 is the subsidy component paid by the telco company when a customer signs up to a term contract. In terms of hardware and service sales revenue, this line item is irrelevant, so is a good candidate for a filtered index.
Another common use of filtered indexes is to filter out null values. There are some considerations or limitations when it comes to implementing a filtered index. A filtered index cannot be created on a view and only supports simple comparisons and does not support the use of the like operator.
For complex filtering where multiple tables may be involved, a view is the preferred candidate. If the column that an index is filtered on is in the result set or is in an expression of the query's where clause, that needs to be in one of the key or included columns of the index, otherwise it can be any column from the underlying table.
CLR data types can't be used in a filtered expression, and an implicit or explicit data conversion on the left-hand side of the filter index operator will result in an error. All these limitations relate to the index's performance, that is too much work by the query engine will negate the benefit of the index.
Creating a filtered index is straightforward, and basically involves a simple where clause tacked on the end of the create non-clustered index statement. Let's go back to our insert statement and apply some of our index learnings.
First of all, I'll just get rid of these hints. Having created the filtered index on TransLines, the estimated plan has become much simpler, we no longer have a table scan on TransLines. This has been replaced by an index scan, but that index only contains products we are interested in. There are still table scans of customers, TransHeaders, and items.
Now the statement takes about 4.9 seconds to execute, which is not the fastest that we've seen so far. Next, I will create a TransHeader customer index that includes the columns we need for our result set. We can see now the table scan has been replaced by an index scan of TransHeaders. When I execute that, there is no increase in speed. But what about if I create a non-clustered index on customers. That makes no difference at all. I've also created a non-clustered index on the items table including the columns I need to retrieve from my insert statement, but the execution plan is not using it, so I've had to force it with a hint.
Looking at the plan now, we can see that the query optimizer estimates 90% of SQL Server's effort will be in inserting the records in the destination table, so on the face of it, it looks as if all these indexes have achieved the outcome we were after. However, the overall execution time does not bear this out.
Let's just reset by dropping all those indexes and running the query again, and the result is pretty much the same. To be honest, a little disappointing. It looks like telling SQL Server to use the index UPC index on TransLines gives us the fastest execution, even though it is a more complex plan with more operational steps. I would definitely take these results with a grain of salt due to the relatively small table size combined with the fact that the differences we are talking about are less than 20% over a five-second duration.
Before we move on, let's look at the effect of a clustered index on record insertion. I'll create a unique compound primary key of the transaction number, branch, station, and line and now rerun the insert statement. 9.3 seconds, so over twice as long to do the insert is in line with expectations.
Just out of curiosity, I'll replace that compound primary key with a single-integer ID field to see what difference that will make. Okay, so less than 1/10 of a second over the non-indexed table insert, that is impressive and does confirm the theory. It is not just clustered indexes that impact insert performance, but all indexes naturally enough will cause inserts to slow down as they have to be updated while the insert is taking place.
I've created a non-clustered index on the SalesData table of branch, transaction number, and station. Then I've run the insert query eight times to give me a total of thirteen and half million records. With that index in place, it took one minute and 13 seconds to insert those records. Removing that index meant the insert operation took 27 seconds and then reapplying the index once the records had been inserted took a further three seconds. That is a significant, and hard to ignore speed difference.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.