Storage & Compression
The course is part of this learning path
SQL is the language for interacting with databases, and like all computer languages, there are many ways to achieve the right result. And like other software languages, some ways will be better and more efficient than others. What sets SQL and database programming apart from most other computer programming is the vast amount of data that is regularly dealt with, and how other factors that are not immediately obvious can impact the SQL query code you write.
This course looks at ways that you can see how the SQL Server engine views your queries and how it intends to execute them. Using that information, we look at other factors like table and index design that can positively or negatively impact your query execution.
For any feedback relating to this course, please contact us at email@example.com.
- Understand what impacts query performance when using SQL
- Learn how the SQL Server engine processes queries
- Use execution plans to see how a query is executed
- Instruct the query engine
- Understand how different query types can be used
- Learn how to design tables to make queries more efficient
- Use database file configuration to improve performance
- Understand how database compression can impact performance
This course is aimed at database administrators, data engineers, or anyone who wants to learn how to improve database query execution.
To get the most out of this course, you should be familiar with SQL query syntax, SQL data types, and SQL table structure.
Before we look at the column store index type, I want to look at table design and how a more efficient design will take up less disc space and can improve query performance. Good table design is built on the concept of normalization. And the main theme of normalization is the elimination of data duplication. This is the main reason why databases have many tables, as opposed to just one big table.
An extreme example of data duplication would be having all of a customer's data, name, address, and personal details, repeated with every transaction that the customer has with an organization. Operating like this would not only be incredibly wasteful of desk space and consequently IO and CPU resources, and querying the data, but would be a complete nightmare to maintain.
The process of normalization goes through several stages, most commonly referred to as first normal form, second normal form, and third normal form. The process of normalization is progressive, in that to be in second normal form, data must already be in first normal form and to be in third normal form, it must already be in second normal form. There is a fourth and fifth normal form, but there are of little practical use.
Let's have a look at how we can improve the storage of our sales data. As I said before, first, second, and third normal forms are all about having the minimum amount of data while maintaining the correct relationship between all the data elements. Here we have our flat sales data structure. What we're going to do is transform this data into a much more efficient structure, while at the same time, modifying some of the data types where appropriate.
Let's start by pulling it apart and categorizing the data into logical entities. A sales transaction occurs at a branch, at a cash register called a station. More than one item can be purchased at a time, or per transaction, and the purchaser is a customer. Branch, TransNo, and Station, uniquely identify a transaction. Those fields in combination with line, uniquely identify a transaction detail line.
Each detail line is describing the sale of one type of item that is identified with a unique product code, UPC. Each transaction is processed by a salesperson and has a purchaser in the form of a customer. And each customer can make multiple purchases. Here we have our logical entities, with the sales data columns appropriately assigned.
I've also drawn in the multiplicity relationships between entities, using the crow's foot notation. The name crow's foot comes from the symbol depicting the many or multiple relationships, as we can see between items and transaction detail, between transactions and transaction detail and customers and transactions. The circle, or O, means the relationship is optional, while the bar across the relationship line means it is mandatory.
In the case of items and transaction detail, an item can appear in multiple transaction detail records, but it doesn't have to, that is an item can exist, but never gets sold. The same is true of a user, they can exist, but may not ever make a sale. In real life, it would be odd to have a sale without selling any items, but from a database point of view, this is legitimate.
On the other hand, while customers can make multiple purchases, it is the act of buying that creates a customer. A customer cannot exist without a sale, in all cases, except for sales person and users. Each table now has an ID column that can be easily and concisely referenced from other tables. This gets rid of the bar chart column references, in the case of transaction detail, that has the most records, the duplication of branch, transno and and station key. This level of normalization equates to second normal form.
Before we move on from the logical design, I want to briefly touch on third normal form. I realize eliminate transitive dependencies might not mean a lot, but the items categories are a good example of this. The item categories are hierarchical. At the top, we have departments, then sub-departments and then classes. It is redundant to have department and sub-department as well as class, in the same record or row as the item UPC.
Every item belongs to a class and if we know the class, then we know the subdepartment and from that we know the department. By associating the UPC only with the class and removing subdepartment and departments, we eliminate duplication and repetition. These are the new tables to house the sales data.
Part of the multi-table design is using ID integers to associate the tables. Using auto-generated integers as the primary key, has a number of benefits. The integer type is very compact when compared to varchar, and removes the need of the duplication of branch, transno and station, in the train line table.
I've also changed the money type to numeric, but not because of space considerations, but due to accuracy when doing calculations. The money type is SQL Server specific and can be prone to producing erroneous results, due to rounding errors.
Now that I've created and populated the new normalized tables from the flat sales data table, we can have a look at the comparative sizes. If nothing else, there has been a dramatic reduction in disc space, used to store the same data. The new normalized data structure's around half the size of the original sales data table. What's going to be interesting, is to see how this new structure impacts query performance. And that's what I'm gonna look at next.
What I have here is a query that will hit the sales data table, for every month starting in 2006, and go up until the end of 2019. It's grabbing all the invoice details for each sales person, where the transaction total is over $800 for items that belong to the mobile department category. At 13 and a half million rows, this is a significant amount of data that should enable us to get a good handle on our query optimizations.
I run the query first, without implementing any of the optimizations suggested by the optimization engine, just to get a baseline. That took 43.4 seconds. Let's have a look at what the optimizing engine suggests, in terms of additional indexes. Based on the where conditions of logged being in a particular date range, the fact that we want valid sales people, and we're only looking for invoices over a certain amount, the suggested index is logged sales person and title before tax.
Because we are also outputting branch, transno, station, customer name, UPC, department, subdepartment, quantity, and sub before tax, these fields are included. I'll just create that index. Before I run the query, let's have a look at the size of the sales data.
PK underscore sales data refers to the actual table of just under two gigabytes. And we can see the newly created index at 1.2 gigabytes. Okay, let's execute that and fast forward. 18.7 seconds is a significant improvement, but I want to see if the optimizer has any other suggestions. It does, it wants me to create another index, very similar to the first one, but using department as the initial key field.
I'm guessing this is to do with the joint on departments table. I'll create that index and then just go back, to see if there are any other suggestions, the optimization engine wants to make. There aren't, so let's run that. Almost 51 seconds is not good. In fact, that would definitely be classed as a regressed query. We can put this performance degradation down to the sheer volume of data.
Both of the recommended indexes are very wide and very similar. Doing an additional index scan, where there are relatively few index records per page, has had a negative impact.
Moving on to our normalized data. Let's see how that goes without implementing any suggested optimizations. It's the same query, but obviously involving more tables, but overall a smaller amount of data. Straightaway, we can see that despite the greater complexity of the query due to the number of table joins, this is slightly faster at 39.4 seconds, than our flat structure.
The query optimizer does make a suggestion that is very similar to the first one it made for our flat data, which is logged, salesperson, and total before tax, which are the columns involved in our where predicate. And as before, we see a similar performance improvement with the query executing in 14.2 seconds.
Going back to look at the estimated plan, again we can see that another index has been suggested. Unlike the previous queries, second suggestion, this one is completely different, using the foreign key relationship between train header and train line, to join the two tables, and including completely different fields.
Due to the table normalization and the linking fields, with the exception of sales person, which has a fixed chart column, all the other columns and the indexes are integers or numeric, making those indexes far more compact and thereby fitting more of them on a page. Having created that additional index, let's run the query again. Now that's what I'm talking about. 1.8 seconds as a massive improvement in anyone's language. And that is right out of the box.
Before we move on from this, let's just compare the size of the two data sets. Remember, we had around three gigabytes for the flat sales data structure, while the normalize data is about half the size. If there was any doubt about the importance of well-structured tables, this exercise settles that, with halving of the data in size and a tenfold increase in query speed.
It also demonstrates how the SQL Server engine works better with normalized tables, rather than a flat structure. It is a little disturbing that the performance regressed as much as it did, with the application of the second index for the flat table structure. After all, it's generally accepted that a less normalized flat structure is preferred, for data retrieval and reporting. As it turns out, SQL Server does have a feature in the form of column store indexes, that greatly speed up such data access.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.