Introduction to Statistics


Big Data and AI | SDL4 A3.1 |
Introduction to Statistics

Data is information and, just as there are lots of different types of information, there are different types of data. In these videos, you'll learn more about types of data and the ways in which you can store them. 

When you're ready, click 'next step' to continue.


It's probably a bit of a good idea just to have a little bit of a discussion of statistics to give some background to this. So, in statistics, really, the thing of interest is inferential statistics for machine learning, but before we come onto what that is, let's talk about descriptive statistics first. So, descriptive statistics are all the characteristics of data that you can use to summarise, and not predict or infer anything. To summarise. So, these are the things that we are all often familiar with, mean, mode, median and such numbers of that kind. Range, standard deviation, and you may know something already of what these mean, but for the sake of a little revision, let us just say that mean, well, let's go in reverse, range, is the distance or difference between the highest value and the lowest value. 

So, if you take someone's salary for example, so if we're looking at an organisation and ask what their range of salaries are, we may find they go from, let's say, £10,000 a year to £100,000 a year for the sake of little simplicity. Suppose we ask what the median salary is. Now, the median salary is what you get when you rank all the salaries in order, then you go, by position, to the middle salary, by order, halfway along, and you report what that middle person earns. What that means is, that 50% of the population should have a lower salary, and 50% of the population should have a higher salary, because you ordered them that way. So, in the middle you have, let's say, a disorganisation. Let's say you have £40,000. Let's have a look at the most common salary in here, but the mode is the most common salary by count. What you do is, you go (inaudible 01.58) all of the unique salaries, and you count them up, and then the most common is the mode. Suppose I tell you that the mode is £10,000. The mean may be defined lots of ways, but it is a balancing point. The mean is the only characteristic here which takes into account the data itself, in some ways, the values of the data. The mean is by rank, by position. You don't even look at what people are earning. You just report the middle one. The mode, you just count them. You don't actually look at what they are earning, you just count. The mean looks-, there's a division across them, as you may know from the formula for the mean. You add them all up and divide by the number there are, so you are adding them up, and, therefore, you are-, you are involving all of the people there. So, what the mean is doing is giving you a balanced sense. It's trying to weight, equally, somehow, the salaries, but if there's a really, really, heavy single item, let's say there's a person who earns a million at a very, very, very heavy weight, a very heavy person, and if there's lots of people who earn a very little amount, let's say £10,000, then the mean will be much closer to the million than the £10,000, because the mean is taking into account how big the values are and balancing, if you like, between them, and there's this, this middle point between all of the values. So, that's descriptive statistics. 

Let's talk about the inferential kind. So, the inferential kind is the one we've been considering above, which is where we have some muddle, some relationship, some function which connects something we know to something we do not know, and it computes such a thing by combining it with parameters or, what we would say here is ax plus b. So, a is some number, b is some number, and if we can find just the right numbers for a and b, we will be able to calculate or estimate a target, y, the profit, from their age, say. And so, what inference is, in our world of machine learning, anyway, inference is estimating the value of parameters. So if a and b are the true numbers, maybe, we try and estimate a and estimate b. It's a little unusual to put little hat symbols on the parameters, because the parameters are understood to be estimates, anyway, but what might it be? Well, if I say, one half times the age plus one. That can give us an estimate for the profit, say, and so, the job of inferential statistics is to come up with good values for these a and b, and it'll give you, hopefully, some measure of how wrong you might be as well. 

So, inference is all about guess work, really, but the key thing about being statistically competent is that your guesses have errors. You are able to evaluate how well you have done, and how likely you are to go wrong. That's the whole point of statistics in some way, is to give you some sense of certainty and uncertainty, and to characterise to give you some sense of certainty and uncertainty, and to characterise your uncertainty, and to give you good, or best you can get estimates in the face of all kinds of uncertainty. All kinds of things you might not know. Now, with inferential statistics comes the need for data, because data is how the machine or the practitioner will compute estimates for the parameters, but for this we need large amount of data, and, therefore, enters into the conversation, the question of big data. 

About the Author