Optimizing a Power BI Data Model
The course is part of this learning path
In Optimizing a Power BI Data Model we start by investigating the underlying structure and operation of Power BI's VertiPaq database engine. Armed with this knowledge, we investigate various strategies for optimizing performance. Not to give the game away, most of the strategies involve reducing the size of the data model and loading pre-summarized data. We also look at tools within Power BI and Power Query Editor for analyzing report and query performance as well as internal error tracking and diagnosis.
- Remove unnecessary rows and tailor data sets for report audiences
- Change data types to improve storage and split columns into multiple columns to reduce the overall number of unique values
- Remove redundant or duplicate data and replace columns with measures where possible
- Use Performance Analyzer to highlight report bottlenecks
- Create aggregations for imported and direct query data models
- Configure and use Power BI's and Power Query's diagnostic tools
This course is intended for anyone who wants to improve the speed and responsiveness of their reports by improving the underlying data model.
We would highly recommend taking the Developing a Power BI Data Model and Using DAX to Build Measures in Power BI courses first, as they go into depth about how to implement several of the topics discussed in this course.
Aggregations are exactly what they sound like. Instead of pulling all information from your data source that might be down to the transactional level, the data is summed or aggregated at the source. The main rationale behind aggregation is to reduce the size of the dataset, which in turn reduces the model size and speeds up data refreshing. This is most helpful when dealing with very large datasets but can be implemented on smaller datasets that you know will grow to be large in the future.
Granularity or resolution is the main consideration when it comes to aggregation. That is, by what dimension do you want to aggregate. In most cases, data is aggregated by date, where we lose the time component from a transaction, and all transactions fall into the same day bucket. This is not a hard and fast rule and depends on your data source and reporting requirements. You can aggregate by geography, so city or state, or perhaps by product type.
Currently, we're pulling just over 1.6 million rows into our SalesSummary fact table. My current reporting requirements are gross sales and margin over time by the type of branch. If I have access to the data source or have the ability to have a DBA create a view, I could use an SQL statement like this to retrieve my SalesSummary data. This will reduce the fact table from 1.6 million to 32 ½ thousand rows. If you don't have access to the ETL processes that load your data model, you can implement aggregation through Power BI.
Before we set up the aggregation in Power BI, I'll remove tables from the data model that are surplus to our aggregation requirements
Once that's done let's open up Power Query editor. There are a couple of ways you can implement an aggregation, and it really comes down to preference, as both of them will end up with the same result. I can create a new query by going to New source and selecting SQL Server, and under advanced options, use the same SQL statement as you would if querying the database directly. This will give us a new query that I'll rename as SalesQuery. Next, I'll go through and change the data types from decimal to fixed decimal. Alternately, we can edit the current table. Because I have previously removed the GPValue column, which I now need, I'll start by removing it from the remove columns function.
Next select choose columns and select the columns you want in your aggregation table. That will be branch, LineTotal, GPValue, cost, and invoice date. If you want to see the SQL statement is been created, right-click on the last step under applied steps and select view native query. Those of the fields we want, now we need to implement the aggregation by clicking on the group by button. The basic setting only allows us to group by one field, but clicking on advanced let's add multiple fields to group by. I want to group by invoice date and then branch, as branch links to the region table, which holds the breakdown of the report. Now it's a case of adding the aggregated fields.
When you specify the new column name, make sure that it doesn't exist in the current table within the data model. I can create a column called gross sales as that is just a renamed field within the report, but I can't create a column called margin as that is the name of a measure that currently exists. All these aggregations are sums, and once that's loaded, I'll change the datatypes from decimal to fixed decimal and invoice date from date time to date.
Once we've finished in Power Query editor, click close and apply. As you can see, some of the existing measures have stopped working as their input columns no longer exist, so I'll get rid of them. Having done that, I can rename columns to better describe what they represent. Now I'll go back to the reports and update them to use the new sales query and SalesSummary tables. As we can see, both methods deliver the same result, which is to be expected.
I'll quickly change margin and gross sales from fixed decimal to whole numbers. Just for my own interest, I'm going to add back in the formatted LineTotal and formatted margin measures that I earlier used for formatting the data and rerun the performance analyzer. Wow, when I ran this report on the full dataset with the formatting measures, the DAX query took almost 1.7 seconds to execute. Now it is taking 14 ms. This really illustrates the benefits of using pre-aggregated data.
Aggregation also speeds up performance when dealing with direct query datasets. Here I have the same three tables from the previous data model without the FiscalYear table. And I'm going to create a Power BI aggregation table from my SalesSummary direct query fact table.
I've set up a table in the SQL Server data source called SalesAggregation, which is a sum of cost, LineTotal, GPValue, grouped by branch. It's a simple aggregation that has reduced 1.6 million or so rows to 45. Once I've imported the aggregated direct query table into my data model, I can set up aggregation by right-clicking on the table and choosing manage aggregations. The manage aggregations setup window has the table you want to aggregate your data into at the top, so the aggregation table.
Underneath that is a section where you map the columns from the aggregation table to the source table. In this case, I will group by branch and map to SalesSummary as my detail table with the detail column as branch. Next, I will sum cost from the cost column in SalesSummary, then aggregate LineTotal into the aggregated gross sales column. However, when I try to do this, the LineTotal column from SalesSummary is grayed out. This is because I have changed the data type within power Bi from fixed decimal to whole number for LineTotal within SalesSummary. Gross sales and margin in SalesAggregation are fixed decimal.
The data types don't match, so I'll change gross sales and margin within SalesAggrgation to also be a whole number. Now I can go through and complete the aggregation column mappings. Once I've finished, the SalesAggrgation table becomes hidden full report users. This is all part of the Power Bi smarts. When users report off SalesSummary, if an associated aggregation table fulfills the reporting needs, Power BI will use that invisibly behind the scenes. To get the most out of our aggregation table, we need to change the storage mode from direct query to import. An imported table will be in local memory, so is much faster to access than hitting the original data source. We can see this demonstrated in the data view where only the SalesAggrgation will display its locally stored data.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.