Practical Machine Learning
The course is part of this learning path
This course explores the topic of probability and statistics, including various mathematical approaches and some different interpretations of probability. The course starts off with an introduction to probability, before moving on to cover the topics of Bayesian probability, Frequentist probability, statistics, probability distribution and normal distribution.
- We have just talked about probability density functions. And I have said that the area under the curve of those functions is a probability. But there are other characteristics of such distributions you may want to look at. In particular, measures of centrality and measures of dispersion. And I want to look at those two topics in the context of a normal distribution. We're gonna look at here, measures of centrality, and dispersion. So, to motivate both of these measures, let's take an example. Suppose I'm looking to compare two bakeries. Let's say there's a bakery. Let's say I'm having a wedding, and I want to know the quality of the food that I might serve. Maybe some bakeries, maybe some cakes that I'm looking at. So, Bakery A produces muffins which are rated by my guests, potential guests, having a quality of, an average quality, let's say, of seven. And Bakery B has an average quality of six. Now it seems, on the face of it, that I might prefer A. I might prefer the mean seven. Let's draw out some potential values we might have here. So, in A, okay, we got seven, but let's say we've also got two, three, four, five. These are ratings, right? Then we have five, seven, seven, eight, eight, nine, right? So let's say, suppose, their average is about seven. B, what we have is we have six, six, six, 6.9, 6.1, 5.9. I'd say this average is about six. What we're seeing here, of course, is that though the mean of A is seven, the stability of the products are all over the place. And though the mean of B is six, it's very consistently six. And in fact, once we draw this, we'll be able to see that actually there might be a reason to prefer the stability of B over this high variability of A. So, very low quality products can produce in A. And we don't want to have a guest who's upset. We might prefer just having guests that have a reasonably good quality overall. Let's draw these. So in the case of these two distributions, here is A. Let's draw A in red. And A's gonna be a big wide distribution around seven. And B we'll draw in, purple? Be a very narrow distribution around six. There's B. The vertical here is the frequency. Just draw him small, just 'cause it keeps the plane clear. Now what we're seeing here is the center of these distributions. The center of A, which we're measuring here with a mean as a seven, and the center of B, which we're measuring here with a mean as a six, but the dispersion or spread of these distributions is quite different. B is very narrow, and A is very large. And possibly, for our particular use of this comparison, that is to say, to verify the quality of food, we would prefer never to get down into the low regions below three, say, and therefore, we might prefer B because it guarantees us a better quality of food, on average. So what we're, what I'm getting at here is that when we are characterizing distributions and comparing distributions, there's several characteristics of them that we want to look at. There's some centrality characteristics that tell you sort of where the middle distribution is somehow, and there's dispersion characteristics. I want to go through a few more centrality ones. We've seen mean here. I want to go through a couple of dispersion ones as well. So, what are the centrality measures? Key ones are mean, median, and mode. And the dispersion measurements, just for the sake of getting it out in front, is when you do the interquartile range, IQR, the interquartile range, and the standard deviation. So let's define the centrality terms, and let's explain how they play out in another example. So, starting with mode, definition here is just most common value. Definition of median is, it's the value at the halfway position when all values are ordered. What you do is you order your salaries from lowest to highest, and you find the person who is halfway in your ordering. And then the mean. Mean is a equal weighting of all values. So the mean is kinda like a seesaw. What you do is you put, a light value goes on one end of the seesaw, and another light value goes on the other end, then a heavy value comes on, and you balance that with other values. And I'll show you how that plays out in just a second. But the mean is the only centrality measure here which actually takes account of the value itself. The median is about rank position, in sort order, and mode is about the frequency of a value. It's not about its size, just about how common it is. So the weight, sorry, the mean, is a kind of weighting because it treats the values that you have for your salaries, say, as those values. You're gonna to treat them, you're gonna weigh them against each other somehow. I'll show you how that plays out now. So let's look at our salary distribution. So suppose I tell you that, for a particular organization, the most common salary is 50,000 pounds. Now, it seems like, at such an organization, it seems like that's a good place to work, right? The most common salary is 50,000 pounds. Middle of the distribution. 50,000 pounds, let's put it over here, 50. Most common, that means it's gonna be the highest point, most common. But maybe, every, maybe the second most common is 10,000 pounds, and 12, and that's it. So you have this huge lump of people over here. It'd be even more sharper than that. And then you have a very narrow lump of people at 50. So maybe there's like, maybe like 60 people with 50,000, but maybe there is hundreds of people with salaries between 10 and 13. So it doesn't look like such an attractive organization now because it just turned out that the highest, the most common salary happened to be 50. So it could even go for less than 60 people. You can make this very, very, very narrow. So maybe this is only, maybe this is only 30 people. But for each of these individual salaries down here, let's say we had 25 people. So this is 25 people, that's 25 people, that's 25 people, that's 25 people. So there's hundreds in this bracket, but perhaps 50 happen to have the most. So the mode measures the most common value, but isn't telling you, isn't giving you the full picture of distribution. It's one measure of centrality, but isn't the full picture. So, in this case, if we looked at the median value, that would be ranking all of these people, it'd be 25 people over here on 10, 25 people over here on 10, on 12. So 25 on 30 and so on. And then 10.5 might have 25 people. And what you would find if you ordered all those numbers, 20 so you have 10, 10, 10, 10, 10, 25 times, and then drew a median line at the 50th number, at the halfway person, our halfway person would be over here. In other words, 50% of the people would be above the line, and 50% of the people would be below the line, and we would get here a median of something more like 11,000. And the median here, I think seems to be a better representation of this dataset. Medians are often the best representation, the best single number to give as a measure of centrality, because what they are is they're insensitive to outliers. Let's now look at a distribution which has the same median, mode, and mean. So where the mean is, let's say 50,000, and so is the mode, and so is the median. What does that look like? Well, it looks like the normal distribution. Let's show you that. So, the highest point of the mode is 50,000, and half the people have less than it, and half the people have more than it. And we look at the area, which is the mean, it's just equal weighting, we would find that it would balance on 50. Let's show you that. So if I just, have to draw it a little freehand. Sorry, not great. That will do. So here, what we're seeing is that the highest point, which is the mode, is the same point as where the median line is falling. And it's the same point where the area above the line balances with the area below the line. In other words, if we consider this whole region here to be a mass, be a weight, then it would balance at this point, which is the mean, the balancing point. Now, in the final bit of this section, I'd just like to look at dispersion and the normal distribution. So, one way of, another way, another essential aspect of the normal distribution is how wide it is, how disperse it is. As I said at the beginning, if I, I can tell you any centrality measure, like the mean, and still kind of wrong-foot you, 'cause it can appear that if I tell you the mean of one bakery is higher than the mean of another, you might think, well, one bakery is better, but it doesn't tell you the range of the, the kind of quality of their output, the variability of output. That's the dispersion. And there's two ways of measuring it, or two common ways. One's interquartile range. And what that is is, what you do is you take the value at the 25th position. Let's say the value here is, this value down here would be, let's see, a rating of bakeries, say, of four, and you take the value at the 75th position, in 75% in, and so that is eight, or something like that, and the range is the 75th percentile, 75% of the way in, minus the value at the 25th percentile, say tile. And here would be eight minus four, which is four. That's quite a wide range, right? So you get a mean of, let's say seven, going all the way from eight down to four. Another measure here is under the standard deviation. Standard deviation is a little tricky to explain, or it roughly is. And it's how far away, on average, your data is away from the mean. So what you do is you take each point, and you measure the distance to mean, for every point. Then you say, on average, that is to say, what is the average of these distances? And the average of those distances I've drawn in horizontal. Those are the average in red. Let's say the average of these distances is here. So this is the long distance. Now it's a short distance. There's more short distances than long distances. So the line falls a little ways to the short distance. So these are the lines of standard deviation. I'd say here, the standard deviation is three. So it'd be seven, we would often write plus or minus three. We will look at computing the standard deviation, and the IQR, and other centrality measures in Python, in the applied session. I just wanna get out here the visual idea, and the intuition, the concept. So the idea behind standard deviation is, roughly speaking, if I were to pull out a muffin from a batch, what should I expect on average? Seven, and then what should I expect most muffins to be in? Well, plus or minus three. So, if the muffins are actually normally distributed, it will always be the case, if they are normally distributed, that 2/3 of the muffins will fall between the range four and 10. That is to say, plus or minus three. So what the standard deviation is giving you is it's giving you a 2/3 range into which all of your observations will fall. And then what two standard deviations will give you, so if you do seven plus or minus six as your rating, that would be 95% of your observations. And three standard deviations would be 99%. So what a standard deviation gives you is it gives you a way of getting a sense of the spread. And we will look at those percentages, that 66, 2/3%, that 99%, et cetera, we'll look at those in an applied session, where we can actually draw the lines on and look at the areas. For now, let's leave it there. The normal distribution is this special kind of distribution where the mean, mode, and median are the same, and also whose dispersion should be characterized by a standard deviation.
QA is the UK's biggest training provider of virtual and online classes in technology, project management and leadership.