The course is part of this learning path
Assessing Column Data looks at the tools and features in Power BI Desktop and Power Query Editor for assessing and profiling data within a dataset's columns. We cover essential characteristics such as column data types through to distribution and statistical properties of a column's data. This course shows you how to use the various graphical tools in Power Query Editor to visualize data distribution, including uniqueness and distinct values. These tools provide access to filtering functions and detection of empty column values and potential errors due to data type mismatches.
This course also looks at factors that impact correctly importing table relationships into a Power BI data model and pitfalls to look for when assessing and profiling column data.
- Learn how to carry out column profiling and assess table relationships in Power BI and Power Query Editor
This course is intended for anyone who wants to use Power BI for data analysis and wants to learn how to assess the characteristics of their data before doing so.
To get the most out of this course, you should have some experience with Power BI.
I'm going to start by importing tables from the point of sale database. So that's B_Users, branches, the location of the stores, classes, one of the Items category tables. Also, customers and departments, another Items category table. Then items which are the products being sold. Subdepartments, another Items classification table, and finally TransHeaders, the sales transactions, and Translines, the detail lines of each sales transaction. There are just under 2 million sales transactions, so this will take a little time to import. I'll just fast forward through the import process. One of the things that I find interesting once these tables have been imported is that there are no relationships in the model between the tables. This is partly due to the lack of foreign keys defined in the database, except for a compound foreign key relationship between TransHeaders and Translines. Compound keys or indexes are not best practice for reporting-oriented databases, so they aren't supported in Power BI. Within the model pane, you can establish the relationships manually by dragging one table field to the related field in another. I'll also take the opportunity to remove the default summaries from the numeric dimension fields. As an aside, it would be good to have some default global setting to turn off or on the summary behavior when data is imported, and judging from what I've seen in various forums; I'm not alone. I'm just going to fast forward through setting up relationships between tables.
Now that I've cleaned that up a little bit, let's go and look at the data, specifically the TransHeaders table. We can click on the little down arrow on the right of the column header, and that will display a list of unique values within that column giving us the ability to filter rows based on selected values. Receipt is a string made up of the branch, station, and the Transno field concatenated together and printed on a sales invoice, so you would expect that to be unique as those three fields make up the unique primary key of the table. However, when I click on the down arrow of the receipt column, we can see that there are fewer distinct values for receipt than the number of rows in the table at the bottom left in the status bar. So even at this early stage, we can see that there are some issues with our data. The LinesNo field says how many detail lines are associated with this sales transaction, that is, the number of TransLines records. Looking at the values in this field, we can see that some sales transactions have zero detail lines associated with them. Whether zero lines are due to the sales being voided or some data issue, we can filter out those sales by unchecking zero value records. Once again, if we looked down to the lower left, we being told that there are 93 distinct values in the LinesNo field and 92 distinct filtered values after removing the zeros. I'll put those zero line records back because I want to show you an alternative way to filter them shortly. The logged field is the date and time the sales transaction was recorded, and we could change its data type from DateTime to date here to make it more useful for reporting.
That is some of the basic data or field information we can view in Power BI Desktop. For a deeper analysis, we need to head over to the Power Query editor. Once again, we're looking at the TransHeaders table, and there's a couple of things I would like to draw your attention to. Firstly on the left side of the column headers, the fields data type is shown. 123 means a whole number or integer, 1.2 means a numeric or decimal value, ABC is a string or text value, and the calendar and clock icon means a date or time value. The cross and tick icon means the field holds Boolean, or true and false values, while naturally enough, the $ symbol represents the money datatype. Secondly, under the column heading, there is a thin turquoise line. When I run the cursor over that line, we get a little pop-up window with three summary figures. The valid number is the percentage of records that are neither empty nor in error. One thing to bear in mind is that by default, we are only looking at the first thousand rows within the query editor. At the bottom right of the pop-up, the ellipsis button lets us perform some filtering actions that relate to duplicates, empty rows, and errors. Click on the View tab and then check column quality to more easily view those stats at a glance. All the columns in TransHeaders look like there are hundred percent valid with no errors and no empty rows, so I'll have a look at the branches table. Here, we can see a couple of columns with empty rows, which impact the validity score, which is also depicted in the amount of turquoise shown immediately under the column header. The errors are related to data type mismatches within the field or column.
As this data has been pulled from SQL Server, which is strongly data typed, we will not see that kind of error here. Back to TransHeaders, and let's look at the data value distribution within columns by checking column distribution. This displays a histogram graphic for each column depicting data distribution. The Receipt column says a thousand distinct and a thousand unique values. You'll recall from the original Power BI data view that I said we have some issues with the uniqueness of this column. This is something to keep in mind when viewing column statistics within Power Query Editor. We are only pulling the top thousand rows by default, which may not reveal issues in large datasets. I'll go down to the bottom left and change column profiling to based on the entire data set. This will take a little while to refresh. Once it has, we can see there is a difference between the number of distinct and the number of unique rows in the receipt column. For a change of scenery, let's pop over to the Items table and get even more data statistics by turning on column profiling by checking column profile under the view tab. At the bottom, we get basic column statistics and a graphical representation of the value distribution. Back to TransHeaders, remember I talked about the LinesNo representing the number of detail lines in a sales transaction and filtered out records with zero lines? You also can apply filters through the value distribution histogram. Clicking on a bar displays a pop-up with the value, the number of instances of that value, and the percentage of the dataset. There are two filter shortcuts at the bottom of the pop-up; equals and does not equal, and extra filter and replace functions are available through the ellipsis menu.
Now I'd like to filter out the zero value LinesNo records, but unfortunately, there is nothing for me to grab hold of in the histogram. I'll have to use the column header filter as I did previously to get rid of those records. Having filtered those records out, the value distribution graph reflects the new LinesNo values. I'm just going to set column profiling back to the top 1000 rows as dealing with the entire dataset is a little unwieldy. While I'm here, I'll change branch and station from decimal data types to whole numbers and Logged from DateTime to date. It is a bit weird that branch and station are displayed as decimals, as both are Smallint fields in the database. You may have noticed the column profile graph does not stay in view when changing the number of rows profiling is based on, so it is merely a case of selecting and re-selecting column profile to get it to reappear. The logged field data distribution hasn't updated since changing the data type, so I'll click on another table and back to get it to refresh.
Now I want to show some error values, and I'll do that by loading a CSV file, which of course, has no built-in data type information. Back in Power BI desktop, I'll load Specials.CSV and then head back to Power Query Editor. When there is an error, the colored line under the column header shows some red, with the rest of it showing as diagonal lines. Interestingly, when there are errors, the whole column is seen as invalid; it doesn't read it as a percentage of valid values versus error values. I can go to the CSV file, change the invalid values back to valid ones, and refresh the data.
The last thing I want to show you is the same data, but this time cleaned up and in a structure more suitable for reporting and analysis with the appropriate foreign key is in place. This is also an SQL Server database, but this time the sales transactions and detail lines have been flattened out into one table called SalesSummary. This time when I connect to the data source and specify the tables, in this case, all of them, the data model structure imports with the correct relationships already defined as in the database. After a quick rearrangement, we get a model looking like the database diagram. Be aware that if your source data doesn't have the correct table relationships defined, Power BI will try to intuit them based on field names. This may or may not work for you. If your tables' primary key is called Id, then you'll end up with primary rather than foreign key relationships.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.