Optimizing a Power BI Data Model
The course is part of this learning path
In Optimizing a Power BI Data Model we start by investigating the underlying structure and operation of Power BI's VertiPaq database engine. Armed with this knowledge, we investigate various strategies for optimizing performance. Not to give the game away, most of the strategies involve reducing the size of the data model and loading pre-summarized data. We also look at tools within Power BI and Power Query Editor for analyzing report and query performance as well as internal error tracking and diagnosis.
- Remove unnecessary rows and tailor data sets for report audiences
- Change data types to improve storage and split columns into multiple columns to reduce the overall number of unique values
- Remove redundant or duplicate data and replace columns with measures where possible
- Use Performance Analyzer to highlight report bottlenecks
- Create aggregations for imported and direct query data models
- Configure and use Power BI's and Power Query's diagnostic tools
This course is intended for anyone who wants to improve the speed and responsiveness of their reports by improving the underlying data model.
We would highly recommend taking the Developing a Power BI Data Model and Using DAX to Build Measures in Power BI courses first, as they go into depth about how to implement several of the topics discussed in this course.
I've taken that first dataset and saved it as MobileSalesDates so we can see the impact of creating a custom date table and removing the system-generated date hierarchy tables, and improving the cardinality of various date fields within the model. Our starting point is a data model of almost 65 MB. I'm not going to go into creating a custom date table and removing the hierarchies, as this has been well covered in other courses.
However, I will go back to DAX Studio and look at the customer table and see that the added and updated date-time fields have relatively high cardinality and are second only to the unique code field in terms of the contribution to the overall table size. This is not unexpected as added and updated record date time to the minutes and seconds, so a lot of unique values. In contrast, we look at the SalesSummary InvoiceDate field, while also the date-time data type has no time component. Back in Power BI, we can see that InvoiceDate doesn't have any times, and when we look at the customer table, the added value is recorded to the minute while the updated value is recorded to the second.
Now I'll create my own custom date table called FiscalYear using the same DAX code as used in a previous course. I'll mark it as the date table and link it to date fields and other tables. As I link FiscalYear to various tables with system-generated date hierarchies, those local date tables disappear. While this is considered best practice and can significantly impact your data model size, depending on the number of date fields in your model, in this case, where the data model is small and has few date fields, the impact has been minimal.
Splitting a date-time field into two fields, one date field and one-time field, can reduce the model size and, more importantly, reduce cardinality and improve performance due to fewer unique values. A quick back-of-the-envelope calculation demonstrates the rationale behind this approach. If you had a date-time column with values for every minute of the year, that would be 365 days multiplied by 1440 minutes per day, giving you 525,600 unique values. Loading into memory and looking through half a million values of one field is a lot more work than looking through 365 values in a date field and then 1440 values in a time field.
Perhaps the best way to think about it is to suppose the Power BI Vertipaq engine has no special indexing technology, which it does, and the worst-case scenario had you looking for a transaction that happened at 11:59 on the 31st of December. The computer will have to make 525,600 DateTime comparisons versus 365 date and 1440 time comparisons for a total of 1805 comparisons. That is a massive difference in anyone's language.
If you are storing date-time down to the second, there would be 31 ½ million unique values in a year, instead of 365 unique day values and 86,400 unique time values.
I can't create the new columns in Power BI with the calculated column function as the new calculated columns will depend on the existing date-time columns, and as soon as I delete the old columns, the new date and time columns will cease to work. This operation has to be done through the Power Query editor. In Power Query editor, go to the add column tab and click on custom column. I'll give my new column the name of AddedDate and use the date from function to extract the date from the Added field. Doing this in Power Query means that when we refresh the data from the source, we can still access the original Added column, which will no longer be part of the Power BI data model. The new column is created as a text field, so I need to change the data type to date. I'll create another column called AddedTime and use the time from function with the original Added column to extract the time.
Now the AddedTime column has been created, I'll change its data type to time and delete the original Added column. When it comes to the Updated field, it's a case of rinse and repeat as far as the date component goes, so Date.From and change the data type to date. I want to handle time slightly differently for Updated as it is currently storing time down to the second, which is unnecessary and greatly increases cardinality. Time is stored as a fraction of a day.
One way to truncate to the nearest minute is to multiply the hour component by 60 and then add the minutes of the time component before dividing by 1440, the number of minutes in a day. Once again, our new column has been created as text, so I'll change the data type to time. Once it has taken effect, we can see by comparing Updated and UpdatedTime that we have successfully extracted the time component truncated to the minute. Finally, I'll remove the original updated column.
I'll quickly go through the same process for the Updated field in the Items table. Lastly, and most importantly, I'll click close and apply. Back in Power BI, the customer and item tables get refreshed. I have to re-establish the connections between the FiscalYear date field and the newly created date columns of customer and item. I'll save the data model, and we can see there has been a reduction in the model's size of around 7%, so not massive but significant.
But to get a better idea of the impact on performance, go back to DAX Studio and look at the cardinality of those date fields. Updated and Added were two of the top three fields in customer in terms of cardinality and size. Together with UpdatedDate and UpdatedTime account for 0.42% of the entire data model, whereas Updated, the date-time field previously accounted for 5.14% of the entire data model. Put in those terms, that is a massive reduction in size and the number of unique values.
You can improve the cardinality of numeric fields, that is, floating-point numbers, by changing the datatype from decimal to fixed decimal where appropriate. And it is not just date-time fields that can be pulled apart for better cardinality. You could apply the same methodology to email columns by splitting on the @ symbol or addresses by pulling out the house number and the street type. The strategy you choose is very dependent on the nature of your data.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.