In Optimizing a Power BI Data Model we start by investigating the underlying structure and operation of Power BI's VertiPaq database engine. Armed with this knowledge, we investigate various strategies for optimizing performance. Not to give the game away, most of the strategies involve reducing the size of the data model and loading pre-summarized data. We also look at tools within Power BI and Power Query Editor for analyzing report and query performance as well as internal error tracking and diagnosis.
Learning Objectives
- Remove unnecessary rows and tailor data sets for report audiences
- Change data types to improve storage and split columns into multiple columns to reduce the overall number of unique values
- Remove redundant or duplicate data and replace columns with measures where possible
- Use Performance Analyzer to highlight report bottlenecks
- Create aggregations for imported and direct query data models
- Configure and use Power BI's and Power Query's diagnostic tools
Intended Audience
This course is intended for anyone who wants to improve the speed and responsiveness of their reports by improving the underlying data model.
Prerequisites
We would highly recommend taking the Developing a Power BI Data Model and Using DAX to Build Measures in Power BI courses first, as they go into depth about how to implement several of the topics discussed in this course.
Microsoft says, "We recommend that you design models with exactly the right number of columns based on the known reporting requirements." The key here is "known reporting requirements." You could start with a stripped-down, minimalist data model. Still, of course, you run into the issue of constantly having to update that model as new reporting requirements come in based on users exploring and drilling into the model. Having said that, it is less problematic to add columns at a later date rather than deleting columns from a mature model, as you don't run the risk of deleting a column that may have dependencies and measures or calculated columns.
Going back to the data model, where we've already removed superfluous rows and changed the data types of cardinality columns, there is the potential to remove more columns and replace others with calculated measures. It really boils down to what level of resolution or granularity users need in their reports. If you don't need to drill down to individual transactions, you can remove high cardinality columns and columns that add no analytical value. Let's head back to DAX Studio to get a better idea of the contribution various columns are making to our data model size and cardinality. Before I start deleting columns, let's have a look at our dataset size.
I've made a copy of the model called MobileSalesRemoveColumns, and it's currently sitting at 60.5 MB. Within SalesSummary, the fact table the receipt column, which is a string printed on each invoice, has the highest cardinality and takes up the most room. It contributes 22% of the overall data model size. Receipt is concatenated from the branch, station, and trans-no columns, so it represents existing data, making it redundant. If we don't need to go down to the transaction level, then we won't require the serial number, PhoneNo, or HexEsn columns. The GPValue column, which means margin, can be calculated from other fields within the table, which could be replaced with a measure. In the customer table, we need the code field that links to SalesSummary, but do we need to know a customer's exact street address in our reporting dataset? Probably not. Address2, which roughly translates to suburbs, and the city column will be sufficient for geographical analysis.
Let's start by replacing GPValue in SalesSummary with a margin measure before removing surplus columns. I'll head back to Power BI and create a new measure called margin. I'm not going to go into the details of the margin measure calculation as it is industry and company-specific. As you can see, it is based on the connection type represented by the Contype column, and I've used a switch statement to execute the correct calculation based on the connection type. Yes, many of the connections use the same LineTotal minus cost calculation, but separating them out makes it easier to read and maintain. Having set up the measure, I can go to my report and replace the GPValue column with the margin measure. Now I can remove columns that add no analytical value to the data model. So that's receipt, an amalgamation of three other fields. In SerialNumber, the unique ID of a sold item. Number is the telephone number associated with an individual connection and GPValue that we have just replaced.
From customer, I'll remove Address1, the street name, and property number. By removing transactional level data and replacing one column with a measure, we have more than halved the size of the dataset. Let's go back to DAX Studio and look at the dataset metrics. As far as SalesSummary goes, CustCode makes up the largest proportion of the dataset with the maximum cardinality. I've forgotten to remove HexESN which is the hexadecimal representation of the serial number, so there is more space-saving to be had, and I can get rid of the customer's second code field, as code provides a relationship with the SalesSummary. While I'm at it, I might as well delete phone1 as over 1/3 of customer records have a unique phone number. I could go through and delete quite a few more columns, but we can see from the contribution that fields like phone2 and fax make to the overall dataset size there's not much to be gained. Let's have one last look at the dataset size, and we can see there has been a massive reduction in size from the original 74 MB down to around 22.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.