Advanced Analysis with Power BI
The course is part of this learning path
Advanced Analysis with Power BI examines various methods for teasing out insights from data using statistical methodologies and presenting significant findings in visually compelling formats. The course starts with basic statistics such as standard deviation and then progresses to AI and machine learning analysis where Power BI does all the heavy lighting allowing the user to investigate and dynamically explore significant findings.
- How to use Z-scores to display outliers and use the Outlier Detection visualization from Microsoft
- How to use Power BI's Anomaly Detection and Fluctuation Analysis functionality
- Use time-series forecasting to predict future data points with varying degrees of certainty
- Use groups to classify categorical data and bins to categorize continuous data
- Learn about Key Influencers
- Use the Decomposition tree to drill down into a metric manually using known factors or let AI functionality determine which factors are the major contributors
- Use the power of Azure's AI and machine learning to analyze text for positive and negative sentiment, keywords and phrases, and image tagging
This course is intended for anyone who wants to discover insights hidden in their data.
- Have a basic understanding of statistics, like knowing the difference between a mean and median, a normal distribution, and conceptually how standard deviation is related to that
- Know how to connect a data source, load data, and generally use the Power BI Desktop and Power Query Editor environments
- AI Insights demonstration requires a PowerBi.com premium account
Power BI has an AI feature called anomaly detection in the same vein as outlier detection. You can think of anomaly detection as a special case of outlier detection that currently only works with time-series data in a line graph that displays a single metric. While anomaly detection is no longer in preview mode, it is still buggy with not all its features working correctly. Here I have a visualization of Tour to France average speeds for all the editions since World War II.
Over the years, professional cyclists' speed has gradually increased due to better training methods, nutrition, and faster equipment. However, from year to year, the major factor affecting speed is the nature of the course, which is always different each year. The two primary contributors are the length of the course and the number of meters of climbing, i.e., how hilly the course is. In my efforts to make anomaly detection work, I've increased the average speed of the 1971 edition and dramatically lowered the course length along with the meters climbed.
As I said, the X-axis must be a time-series using a date or date-time data type with the axis type set to continuous. Find anomalies are found in the analytics pane, and if you don't have your graph correctly set up, there will be a warning symbol on the right with the function disabled. Click the add button to add anomaly detection. As you can see, the 1971 data point in the line graph has been highlighted as an anomaly.
As the name implies, the sensitivity setting will determine the range of acceptable values before a data point is considered an anomaly. The higher the sensitivity, the smaller the range with more detected anomalies.
Below sensitivity, we have an "Explain by" section where you can drop other fields that Power BI can display as contributing factors to the anomaly. As we can see from the two column charts below, meters climbed, and total distance in 1971 are contributing factors. The anomaly magic and I say that in quotes, is supposed to happen when you click on the anomaly indicated in the line graph. An anomalies pane opens up, giving a brief description of the value being outside the expected range, and below under "Possible explanations," we expect to see the percentage that each "Explained by" factor contributes to the anomaly. This is where anomaly detection currently goes off the rails.
I originally tried to make this work using Power BI desktop with several datasets I downloaded from Microsoft. I even tried with the PBIX report file from a link in their anomaly detection documentation, but still no joy. After trying on a couple of computers without success, I've done this demonstration in Power BI.com. As of January 2022, this is still a problem for many within the Power BI community. In Power BI desktop, anomaly detection alternates between not finding any explanations and giving an unspecified error.
Apart from the functionality not working as advertised, you have an extensive range of formatting options for displaying anomalies. You can change the anomaly's shape, size, and color along with the color and opacity of the expected range. Currently, anomaly detection seems like form over function, but as they say in cycling if you can't go good, at least look good.
In a similar vein to anomaly detection, Power BI has an analyze fluctuations feature that can be accessed directly from the visualization. In Power BI desktop, I'll create another chart of just average speed over year. I have changed the data model by renaming classified climbs as elevation meters and then recreating the classified climbs column as the number of large hills in each tour edition. If I right-click on a data point in my time-series chart, the context menu has an analyze item. The submenu has an "explain the increase" menu item, along with explain the anomaly. Unfortunately, accessing explain the anomaly this way still doesn't work. Explain the increase or decrease doesn't work with the first data point, but it does work with all data points thereafter, whether they are an anomaly or not.
When you do ask Power BI to explain the fluctuation, you get a pop-up window with graphics for factors that Power BI believes are correlated with the change, regardless of which fields, if any, have been dropped into the explain by field. The initial explanation for the difference in the analyzed metric is depicted using a waterfall graph. To be honest, I'm not sure what this graph is trying to tell me. The blue columns represent the average speed between 1970 and 1971, going from 35 to 43Kph.
I know that number of stages is a determining factor in the average speed as that's how I set the data up, but we went from 23 stages in 1970 to 15 stages in 1971. It looks as if this analysis only works with categorical data. So, in this case, you cannot think of 15 and 23 as numbers but as labels for short and long. I say this because elevation meters represent the biggest change in correlated data, going from 18,000 or so in 1970 two 553 in 1971. Elevation meters are continuous data but do not feature in the increase analysis. You can view the analysis data with a variety of charts, from scatterplot to stacked bar and ribbon. You can analyze any data point in your chart. It doesn't have to be a detected anomaly, and it's easy enough to add to your report by clicking the plus button in the top right of the chart.
The analyze the increase, or decrease function does not work with all chart types or data. Specifically, the following scenarios are not supported:
- TopN filters
- Include and exclude filters
- Measure filters
- Non-numeric measures
- Show value as
- Filtered measures
- Categorical columns on X-axis unless it defines a sort by column that is scalar. If using a hierarchy, then every column in the active hierarchy has to match this condition
- Row-level or object-level security enabled data models.
Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.