Profiling the Data
Troubleshooting in Power Query
In this course, we’ll review the Power BI Desktop interface. Then, we’ll show you how to load data into Power BI Desktop and how to save your file. We will also explain data profiling and look at the various data profiling options in Power Query like column quality, column value distribution and column profiling, and the benefits of using these.
Lastly, we will look at how to resolve cell-level errors, empties, and inconsistencies in Power Query. This section will cover how to replace errors, replace values, remove rows, and how to identify the root cause of the error using Power Query. The demos in this course will provide you with practical examples that will help you to troubleshoot when encountering issues while loading data into Power BI.
- Understand how to load data into Power BI and how to optimise functionality and size by using data profiling
- Understand how to resolve errors, empties, and data inconsistencies in Power Query
- Anyone who would like to learn about importing data into Power BI
- Anyone who needs to resolve cell-level errors or empties in a Power BI model or who would like to understand data profiling to improve the functionality of their model
- Some basic knowledge of, or experience in, working with large datasets
- Some experience with Power BI (not mandatory)
The files used in this course can be found in the following GitHub repo: https://github.com/cloudacademy/loading-data-power-bi
In this section, we will look at column value distribution. We will gain an understanding of distinct and unique items in a column, and we will also look at cardinality and why this is important. When I enable column distribution in Power Query, this will show me how many distinct and unique items each column has, as well as the distribution of these, which you can see in the form of a column chart under each header.
The distinct number refers to how many different values there are in a column, once duplicates are excluded. The unique number shows me how many values occur exactly once. This helps us to understand the cardinality of a column. We ideally want to achieve low cardinality. The more distinct and unique items we have in a column, the higher the cardinality, whereas low cardinality is achieved, when we have less distinct items, but more of them.
I'm going to show you an example to explain cardinality. Suppose we have 10 rows in our Promogroups column, have a look at column one, it only includes cheesecake and Bavarian. Cheesecake appears four times, Bavarian appears six times. Therefore, there are two distinct items, cheesecake and Bavarian, and as neither of them appear only once, there are zero unique items.
In column two, we have many different items like, crunchy cheesecake, chilled cheesecake, and iced cake. Each item appears once only, therefore, the distinct items are 10, and the unique items are also 10. From this example, column one is low cardinality, as there are only two distinct items and no unique items, but column two is a high cardinality column, as there are 10 distinct items and 10 unique items. The only time we would have the same number of distinct items and unique items in a column is when all items in the column are unique.
Column distribution helps us to understand the cardinality of a column. We want to reduce the number of high cardinality columns in our data set, so when a column has high cardinality, we should consider whether we really need this column or whether it could be removed. Removing high cardinality columns helps with performance and file size. Let's return to our original PowerBI model. We select column distribution in the view section of Power Query.
If I review the Promogroups column, we can see the number of distinct items are 25, the number of unique items are five. The column charts show the shape of the data. You can see whether the distribution of values is uniform, or if some values appear more frequently than others. There's also the option to copy the data profiling information. Right click select copy value distribution, have a quick look at the options here. You can choose to keep duplicates or errors, remove duplicates, empties, or errors, or replace errors. We will have a look at these again later.
Now, once you've copied the value distribution, you can paste it into Excel. Here, you will see the different promogroups and how many times each one appears in the column. For example, ice cream, one liter appears 13 times, pound cakes appears six times, and fruit appears six times. If I highlight the number of items, I count 25 distinct items. When I scroll to the bottom, I will see that croissant choc, four pack, crumble, original cheesecakes, tiramisu and deep dish cheesecakes, all appear only once in the column. And therefore, it means there are five unique items in this column.
In Power Query, if I look at the Product ID column, this reflects 78 distinct and 78 unique items. This is, therefore, a high cardinality column, but you would expect each product to have a different product ID. And therefore, the product ID would be required to identify the product, so this column should probably not be removed. When I look at the SalesChannel column, there's only one distinct item, and it all seems to be retail branded. Even though this is a low cardinality column, when we only have one distinct item, we might also want to consider removing this column, as it does not seem to add any value.
Note that the distribution of values is based on the current query only, and currently, the promogroup's table is selected. If you go to the bottom of the screen, it is important to also note that the column profiling is based on the top 1,000 rows. This means it is just doing a sample of the rows. When you have a larger table that exceeds 1,000 rows, you might want to extend this to include the entire table. To do this, you can select column profiling based on the entire data set.
This table is not very large, so it makes no difference. But when I go to the financials, this is more than 1,000 rows. I will do the same and select column profiling based on entire data set. As this is a large table, it could take awhile to refresh this information over the entire column. Once updated, the column profiling will include all 28,000 plus rows.
Bianca is a chartered accountant and finance business partner who works with Power BI regularly to create useful, interactive dashboards to analyze financial metrics. She has worked as a lecturer and as a financial analyst in FMCG companies assisting sales and marketing teams with reviewing and understanding their financial results and forecasts, and identifying risks and opportunities for improvement. Bianca enjoys using technology to automate and simplify financial metrics.