The course is part of this learning path
This course delves into the theory behind the topics of statistics, distributions, and standardization, all of which give you a solid foundation upon which the field of data science is built. We look at a variety of aspects of the field of statistics and how to use statistical tools to analyze and interpret data. You will then be walked through the NumPy library and how it can be used in a variety of real-world situations.
- Understand the different types of data and the relationships between them
- Understand the different way of finding the average of a set of data
- Know which statistical tools are available for analyzing data
- Grasp the impact that the distributions of data have on data analysis
- Learn about standardization and its use cases
- Explore NumPy library and its computational and statistical tools
This course is intended for IT professionals looking to learn more about data analytics and the NumPy library.
To get the most from this course, you should already have some basic statistics knowledge as well as some programming experience.
Hello, and welcome back. So now we're going to look at distributions, and to start with, we'll look at probability distributions, normal distribution, and then we'll speak a little bit about central limit theorem, and standard normal distribution. So probability distributions, as with data, we have two types of distributions, discrete distributions, and continuous distributions. When working with discrete and continuous data, we have distributions, we have functions that describe the probability of getting outcomes from each of those. Now, for a discrete distribution, the probability of guessing any outcome can be explicitly described by a probability function. When you have a continuous distribution, you have to use a different kind of function, it's a different order. When we have discrete outcomes, we have something called a probability mass function. So a classic example of this is dice rolls. So on this slide, you can see we have two standard dice with six faces. Whenever you roll them, there are more ways of getting seven than there are of any other number. That's why our distribution looks like the way it does here. But what we do is we have the number of ways of getting each outcome, and if we transform this over to probability space, what we can do is we can simply describe this as the likelihood of getting each value whenever we roll some fair dice. So this represents a discrete probability distribution. So for example, the probability of getting six when you roll the dice is 0.14. We can read this on the graph on the right-hand side. It's a mass function, it is speaking about the actual probability of getting a value. Okay, so we can obtain this using our frequencies. The other type of function we can work with is something called the probability density function, which represents continuous data. Now, it works slightly differently. What is the probability of someone being seven years old? The probability of someone being seven years old is zero, because this is a probability density function, the probability of being an exact value in a continuous distribution, or continuous domain, is always zero. If you think about it, what is the probability of someone being 7.000, infinity number of zeros after the decimal, years old, there's no chance, there's no chance at all, so small, it's infinitesimal. We obtain our probabilities by getting the area under the curve. So if I wanted the area under the curve between seven and seven, again, that's zero, now we could obtain the probability of someone being between eight and six years old, or within all the values between 7.0 and 7.9999999, and the way we would do that is by using an integral. We would integrate between seven and six to obtain the probability of falling within the year seven, someone being seven years old. So that's why we describe it as density because we have to work out areas under curves. So let's continue looking at statistics. We've been talking a bit about probability distributions, probability mass functions, now, thinking about standard deviation, standard deviation, when we have a look at the normal distribution, the standard normal distribution, that 34% of observations are going to be one standard deviation above the mean, therefore 34% or below, and overall 68%, are within one standard deviation of the mean. This is something we know about Gaussian distributions, part of the reason why it's incredibly helpful to model data as being Gaussian, if we can. So what does this mean to us? Gaussian distributions are simply continuous probability distributions. What this means is that the probability of a value being above the mean is 50%, the probability value being below the mean is 50%, their probability of it being within our domain is 100%, so the area under this curve has to sum to one. Thinking about sampling, as we sample, what effect does sample size have on standard deviation? The standard deviation tends to decrease as the standard sample size increases, because as we take more and more data into a sample, our data, especially if it comes from a normal distribution, will tend to distribute itself more closely around the mean. It's more likely for you to get more values around the mean than it is for you to get values further away from the mean. As we increase the sample size, standard deviation tends to decrease, because we just get a smaller spread of our values. A lot of things tend to be normally distributed, although often not as many things as we'd like. It is a distribution which can be described purely in terms of the mean and standard deviation. The function that generates this can be described solely in terms of mean and standard deviation of value. So things like height, test scores in a large class, areas in measurements, why do they tend to be normally distributed? Let's have a look. Let's look at this using the example of the average height of an adult male. The average height of an adult male is going to be around maybe five foot nine, five foot eight, shall we say? We specifically choose one gender, and let's say that we've got adult males greater than 18 years old, and we're going to assert that this is normally distributed, whatever our distribution values are. So how many factors are involved in determining someone's height? First of all, we could say that genetics play a role in the height of a person, and also the environment should have any influence on height, but we could also say things like air pressure, diet, food, we could go on, and on, and on, and list all the variables that are involved in determining someone's height. So there's a lot of independent variables, and we're going to assume that most are independent, that determine that overall height of someone. So the cumulative effect of this is when you have lots and lots of random variables determining some sort of average value, or some value, you're going to get, more often than not, a Gaussian distribution. Because there were combinations of these factors involved in determining your height, which means that more likely everyone is going to end up in the middle. But not everyone is going to end up there, the person who is the average height will have had good genetics, grown up in a good environment, but maybe not had access to food or something like that. The person who was a tiny bit taller than they are, will have had access to good food, but they didn't have as good genetics, or whatever strand of genes they may have had, the more and more extreme you get within this distribution, the more factors have gone right and inverted commas, for you, making your taller. But because it's essentially a random process, there is a bulk around which most observations will gather, and that is the average height of a human man. So generally, if something is the result of lots of random independent variables, we will end up with a normal distribution. As well as things like human height, scores on a test given to a class as another example. So again, there are many factors that determine your score on a test, for example, sleep, hours spent studying, genetics, how much you enjoy the subject, et cetera, so all these various factors that go into play to end up with some sort of average person's performance, and then you have people above the average and below the average. So that's why the Gaussian distribution pops up so often, it's the normal distribution. Now, let's ask ourselves the question, are all samples the same? This graphic here shows that samples are not the same. This is how we lead into the central limit theorem. Every time you take a sample, it would be a valid assumption to make that you aren't going to get the exact same, for example, population parameter estimates, or sample mean, every time you take the mean of a sample. In our example here, we've taken three samples, we've got three completely different central statistics. But what we do know is this thing called the central limit theorem, the definition of which is shown here on the screen. So given a sufficiently large sample size from a population with a finite level of variants, the mean of all samples from the same population will be approximately equal to the mean of the population. Hopefully, that makes sense, but all it says is, if you take enough samples of sufficient size from a population with a finite level of variants, the mean of all these samples, all these sample means, is going to tend towards the population, and in fact, as you take more, and more and more samples, a sufficient number of samples, the distribution of sample means is going to be Gaussian, it's going to be a bell curve, it's going to be normally distributed. So the fundamental theorem of statistics allows us to estimate parameters and populations by taking sufficient samples of sufficient size, that's how we can infer population statistics. If you have an exponential distribution, if you have a U-shaped distribution, et cetera, all these distributions, no matter what, if you continue to take samples, and you continue to sample means, you will get a distribution centered around the population parameter. If you continually take samples from a population, you'll end up with a sort of structure of population means that will, no matter what, tend to be normally distributed around some population parameter statistic. This is the general idea, you sample, you sample your sample, and it is simply a fact of life that you will get this kind of structure.
About the Author
Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.