1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Advanced Analysis with Power BI

Outlier Detection Demo

Contents

keyboard_tab

The course is part of this learning path

Start course
Overview
Difficulty
Intermediate
Duration
1h 4m
Students
79
Ratings
5/5
starstarstarstarstar
Description

Advanced Analysis with Power BI examines various methods for teasing out insights from data using statistical methodologies and presenting significant findings in visually compelling formats. The course starts with basic statistics such as standard deviation and then progresses to AI and machine learning analysis where Power BI does all the heavy lighting allowing the user to investigate and dynamically explore significant findings.

Learning Objectives

  • How to use Z-scores to display outliers and use the Outlier Detection visualization from Microsoft
  • How to use Power BI's Anomaly Detection and Fluctuation Analysis functionality
  • Use time-series forecasting to predict future data points with varying degrees of certainty
  • Use groups to classify categorical data and bins to categorize continuous data
  • Learn about Key Influencers
  • Use the Decomposition tree to drill down into a metric manually using known factors or let AI functionality determine which factors are the major contributors
  • Use the power of Azure's AI and machine learning to analyze text for positive and negative sentiment, keywords and phrases, and image tagging

Intended Audience

This course is intended for anyone who wants to discover insights hidden in their data.

Prerequisites

  • Have a basic understanding of statistics, like knowing the difference between a mean and median, a normal distribution, and conceptually how standard deviation is related to that
  • Know how to connect a data source, load data, and generally use the Power BI Desktop and Power Query Editor environments
  • AI Insights demonstration requires a PowerBi.com premium account
Transcript

You can detect outliers in your dataset by visual analysis, statistical analysis, or a combination of both. When it comes to detecting outliers with statistical confidence, selecting the most appropriate method for your data distribution is critical. As you may know, many datasets have a normal or bell-curved shape distribution, where most data points are close to the average value. As you move away from the average value in both directions, there are fewer data points. In normally distributed data, the distance from the average value is measured in standard deviations. Outliers are values in the tails of the bell curve described by X number of standard deviations from the average expressed as Z-scores. First off, I want to show you how to implement ZScores in your data and then use that metric to visually display outliers. 

For this demonstration, I will use excess mortality data from the Economist magazine. As an aside, excess deaths over and above predicted or expected mortality can give us a better idea of the true impact of covid-19. One of the issues that have cropped up in several countries is the misdiagnosis of the cause of death or attributing the death to covid-19 when a person may have died with it instead of from it. Anyway, after a fair bit of data wrangling, I've got the excess death data into a spreadsheet, and I've added a column of the fully vaccinated proportion of the population for each country.

In Power BI, I'll import the data from the spreadsheet by selecting sheet six and clicking transform data. The first thing I'll do is rename my dataset to excess mortality. The ZScore formula is the data value minus the average of all data values divided by the standard deviation. To make the comparison between countries valid, I'll be using excess deaths per hundred thousand. To get an average figure for that column, I'll right-click on the column header and select "Add as New Query." In the List tools transform page click the Statistics button and choose average. By default, it has the same name as the column, so I'll just rename that to AvgExcessDeathsPer100K. 

Back in the dataset, I'll right-click again on the column header and choose Add as New Query, but this time select standard deviation as the statistic. Once again, I'll rename that by pre-pending STD DEV to the front of the name. The next step is to create the Z score column. Back in the dataset view, select add Custom column. Unsurprisingly, I'll call the new column ZScore and enter the formula (ExcessDeathsPer100K minus average ExcessDeathsPer100K) divided by StdDevExcessDeathsPer100K. Before clicking close and apply, I'll also change the data type to decimal number. In the Power BI data model view, we can see our new dataset with the ZScore column, along with the average and standard deviation statistics. Let's head over to the report view, where I'll drop a scatterplot visual on the report page. Before getting started, I'll just remove all the default summaries from the excess mortality dataset.

I'm going to use the vaccinated field to separate the data points on the scatterplot, so I'll drop that onto the X-axis. Next, I'll drop ExcessDeathsPer100K onto the y-axis. Straightaway, we can see potential outlier values at the top left-hand corner of the scatterplot. From a visualization point of view, we would like to highlight the outliers when they exceed a particular ZScore value. When looking for outliers, ZScores above three is a good starting point. We can set the size of the data points to the ZScore. When I dropped ZScore onto the size parameter, it defaults to count, which is one for all the data points, so I'll need to change that to average, where the average value for one data point is the ZScore value. Well, sure enough, it is no surprise that the data points at the top left, which look like outliers, have the largest circle, but this still doesn't help us. By hovering over the most distant outlier candidate, we can see that it has a ZScore of 3.46, but it would be good if the outliers jumped off the page for us.

We can create a measure that we can use to highlight outliers. The measure is outliercolor, and I'll set the ZScore cut-off or limit to 3 and then use an if statement to say if the value is greater than or equal to 3, then make the data point red, otherwise make it blue, using RGB hexadecimal values. Of course, ZScores can be negative, so I just want to check that I'm taking that into account in my calculated column formula. In Power Query, I'll go to the added column step and check the formula. Ah, I need to wrap the ZScore calculation in an absolute function. Right, apply and close and back to the scatterplot. Expand data colors in the format pane and click on the FX formula button next to the default color. In the data colors dialogue, select field value from the format style drop-down, then select the outliercolor measure as the field to be based on. And voilà, our outlier with a score of 3.46, is now highlighted in red. That's great, except that if we want to change the ZScore cut-off parameter, we need to go into the measure each time. It would be better if we had some visual control to dynamically change how strict we want to be with our outlier classification.

Under modeling, click new what-if parameter. I'll call the parameter ZScore cut-off and make the data type decimal, with a minimum value of one and a maximum of 4. Set the default to 3, and I'll make the increment value 0.25. Make sure you leave the add slicer to this page checkbox ticked. Next, I'll replace the 3 inside the outliercolor measure with a reference to the ZScore cut-off value. Hmmm, the default value of 3 doesn't appear to work. Anyway, when I move the slider, the number of data points highlighted in red changes as I vary the ZScore limit. Having gone through the exercise of detecting outliers using a ZScore, there is an outlier detection visualization from Microsoft available in the Power BI marketplace. 

Let's see what that looks like. Click the 3-dot ellipsis at the bottom in the visualization's pane and select get more visuals. Once the marketplace window comes up, type outliers in the search box and select outliers' detection. This is a comprehensive plug-in with multiple visualizations and outlier detection methods, but the installation is a bit fiddly. First and foremost, outliers' detection uses R statistical packages to perform the calculations, so you will need R installed on your computer when using Power BI desktop. And not just any version of R. At the bottom of the page, the supported versions are specified. 

I could only get this visualization to work by using Microsoft R Open 3.2.2, whereas the current major version of MRO is 4. You can find older versions of Microsoft R Open in the release history. This plug-in also says that it will download the necessary R packages, but I found this only happens if you click download sample, as opposed to just clicking the add button when installing for the first time. If you have R already installed on your computer, you can go into options under options and settings, and within R scripting, you can select the version of R you want to use from the detected R home directories drop-down. Alternatively, click on the how to install R hyperlink to get detailed instructions about installing the statistical package. 

I'll start by recreating the ZScore scatterplot by selecting the outlier detection visual, dropping ExcessDeathsPer100K onto the variables parameter, and vaccinated onto the ID parameter. Straightaway, we can see we have an outlier detected at the top left of the scatterplot as we did earlier. If you go into the format pane, you can adjust some of the parameters like the algorithm to use, which is obviously ZScore in this case, and the number of sigmas that defaults to 3 and indicates when a value is deemed an outlier. If I change it to 2 sigmas, we can see that it matches the earlier scatterplot. I'll adjust the opacity to bring it more in line with the earlier example.

As I said, there are multiple visualizations and algorithms you can use. Some of which are better suited to skewed or data that isn't normally distributed. I'll change from scatterplot to box plot and the algorithm from ZScore to Tukey with an interquartile range of 1.5. You can also use the local outlier factor algorithm and a density visualization, which leaves no doubt about the outlier values. You can use Cook's distance, a least-squares regression analysis requiring an independent variable. In this case, I'll use the vaccinated column.

Whether you choose to use a predefined tool for detecting outliers or create your own statistical analysis, it is clear that implementing the appropriate graphical visualization will make outlier data points really pop.

About the Author
Avatar
Hallam Webber
Software Architect
Students
19585
Courses
46
Learning Paths
7

Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a  Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.