The course is part of this learning path
This Course delves into the theory behind the topics of statistics, distributions, and standardization, all of which give you a solid foundation upon which the field of data science is built. We look at a variety of aspects of the field of statistics and how to use statistical tools to analyze and interpret data. You will then be walked through the NumPy library and how it can be used in a variety of real-world situations.
Learning Objectives
- Understand the different types of data and the relationships between them
- Understand the different way of finding the average of a set of data
- Know which statistical tools are available for analyzing data
- Grasp the impact that the distributions of data have on data analysis
- Learn about standardization and its use cases
- Explore NumPy library and its computational and statistical tools
Intended Audience
This Course is intended for IT professionals looking to learn more about data analytics and the NumPy library.
Pre-requisites
To get the most from this Course, you should already have some basic statistics knowledge as well as some programming experience.
So now we're going to talk about statistics. We're going to start off just thinking about greatly rudimental summary statistics. We want to talk about measures of what a variable is and thinking about central tendency measures, and then moving on to measures of dispersion and things like that. So this should all be reasonably familiar to us, but it's worth thinking about it in a bit more detail. And then we're going to start looking at things like distributions of data, so things such as probability distributions and things like that. So what do we mean by variable? Variables for us, statistically speaking, are the characteristics of something, they are qualities. They are, for example, a country, a population, a phone call. All these types of things that are just general categories of data, these are what we call variables. The data themselves are the values the variables take. So in the mathematical sense, variables are more numeric, we think of then as numbers. In statistics, it can be anything, anything that can vary, any category that can vary. So we have columns given by variables, observations given by rows, and then each of these are datums or data points. So we have two main broad branches of variable. Then we can further distinguish between them. So we've got quantitative data and qualitative data. Quantitative is numeric, something that can be measured on a scale, and qualitative is categorical. And categorical data sometimes can be ordered and sometimes it can't be. We have nominal data, which are things like gender, eye color, categories that have no inherent ordering within them. We have ordinal data, which is categorical data that does have an inherent order. For example, high, medium, and low. Now, obviously, some of this is up to the statistician to decide whether they want to order that data or not. These are the two main types. We need to visualize these and work them in different ways.
Thinking about numerical data, it's simply based on the pattern of the slide show. There are going to be two branches of numeric data. What kinds of numeric data can we have? We have continuous data and we have discreet data. So continuous examples such as age, light, height, temperature can take any value within a range of numbers. Which means there is an infinite number of possible values they can take within any interval in that data. And then our discrete data, we can think of that more as a set. We have discrete actual data points that something can be. So the number of heads in 100 coin flips. We're going to have integers there. Or the number of children in a family. Again, specific integers of numeric data. So these are the types or the forms of variables. We also have dependent and independent variables, and these describe whether something does or does not depend on certain factors. Generally in our modeling, we are trying to find out whether some variable depends upon a collection of independent variables and potentially some sort of overlapping dependent variables. And they all have these various names. Explanatory, explained, predictor, response, regress, and regressand. We tend to plot our dependent variable on the Y axis against our X, because we always want to visualize the change in Y with respect to a certain X or Xes. So in detail, this simply means that an explanatory variable might affect an explainer variable. That's what most modeling aims to pick up on, specifically in machine learning. Machine learning is function approximation. The function that we are trying to approximate is the function relating to the explanatory variable through the explained variable. And then there's always that sort of warning that correlation doesn't imply causation. I'm sure you've heard that many times. Most things in the world are correlated. If you were to map the correlation between that of the rain falling on the ground and my feet hitting the pavement when I walk, you'd likely find quite a high correlation or reasonable correlation. But that doesn't mean that, if I start walking faster, it rains heavier, even though it feels like it. And if it starts raining faster, I'm not going to start speeding down the pavement. If you're interesting in this topic, then I would definitely recommend, in your spare time, having a look at this spurious correlations website, which basically has a lot of graphs relating two completely unrelated things together, like deaths of people getting tangled in bedsheets with per capita cheese consumption, et cetera. So when it comes to describing the relationships between variables and trying to see if there are a lot of relationships between variables, we have plots available to us, such as this scatter plot.
So we can get lots of information from this graphic in terms of data density and the fact that most people don't study for more than 30 hours a week. You've got some mega studies over here. In terms of relationship, it doesn't look like there is a strong correlation between these values, but that's not the be all and end all. I would argue there appears to be some relationship, but we don't have enough to support a link. We can see that there's something wrong with this data. It's not possible to get a GPA higher than a four. So this outlier is highly likely to be due to a data error. So scatter plots are fantastically versatile in terms of what they allow you to pick up on. We have outliers, we've got erroneous values that you can also get information about the trend simply by plotting in two-dimensional space. Let's now think about the branches of statistics that are of interest to us. We have descriptive statistics and inferential statistics. Inferential statistics tend to be more the domain of the data scientist, and descriptive statistics are more the traditional mean, median, and mode, describing the data that you have, like historic data you were going to dive into and obtain information from. Inferential statistics is using machine learning algorithms, using sample data to extrapolate things about populations, and making predictions. All of that is inference. We all carry out statistical inference when we cook, because we try little spoonfuls of whatever we're cooking, and based upon that little spoonful, we will have an idea about the whole dish. That's sampling and using inference to get an idea of the dish as a whole from the tiny little bit that we're tasting. It was inference, inferential statistics. But what could go wrong when you're sampling a little bit of your soup, for example? If you've been putting seasonings in and salt and everything like that, and then take a sample from that area and then it's really salty, there'd be a chance that you might have accidentally taken a very salty sample which doesn't actually tell you what your whole dish or your population looks like. So it's always worth considering that when sampling.
So we're going to start with descriptive statistics and then we move on later in the course to inferential statistics. So looking at the numbers on screen, what is the average of these numbers? What's the correct answer to this question? So really, the correct answer here is which average? That's the correct answer to this question. Which average are we talking about? With these numbers that have been chosen specifically, each average gives a different figure. But which average is the most appropriate? The mode gives us five, the median gives us four, and the mean gives us three point. Average is a way of describing a central tendency. What these do is they take entire collections of data, columns of data, and they will squash them down into a single number that is representative of some notion of where the center of our data is. Now, each central tendency statistic describes a different kind of center point and a different kind of expected value. As with the median, here, we've got a very helpful slide describing how we get the median, and the median is the central value in terms of ordering when we order our data. What median tells us is the point at which we have as many observations below as we do above that center point. If we have extreme values with data and we use the median, it isn't going to be skewed by those extreme values. Or at the most, it is only very minimally skewed simply by the addition of another data point. And then we have the mode. Mode gives us the most frequent value. The mode tells us the most frequent value in terms of our data. It gives us the expected value simply in terms of frequency. If we had an extreme value and we compute the node, it doesn't change. The mode is, we can argue, the least sensitive to extreme values because the mode will only ever change if the most frequent value changes. So if you end up with an extreme value, you would need to have as many extreme values as you do the most frequent value for the mode to change. So your entire distribution would have shifted at that point anyway. The mode is completely unaffected by extreme values. And then we have the mean, an idea of the fair share of our data. We can think of it as the balancing point of a distribution.
So we compute it like this. If we have an extreme value and we're working with the mean, is the mean sensitive to that extreme value? Yes, it is, the mean is very sensitive to that extreme value. It's, in fact, the least reliable statistic when it comes to extreme values. And we can see this in these two examples here. On the left hand side, we have a non-skewed distribution or a relatively non-skewed distribution. We have the mean, median, and mode statistics. So the median is 4.5, the mode is five, and the mean is 4.8. And when we add in this outlier, our median remains at 4.5, our mode remains at five, but our mean ends up going to 20. So the point being made is simply that, with any extreme value, it's going to completely transform the mean and so it may no longer represent a central point. So which central value should we use then? So in each of these cases on the screen, which central value is the most appropriate? In the first one, we could use either the median or the mode, since they won't be skewed by the figure of one billion. For the five people in the room, the median is the best average figure. So the median is the most robust in the sense it gives an idea of what the general center of the data looks like. The next one, we have 50 students, 30 are 30 years old and the rest are 29 or 31. So the mode is useful in this case, particularly because the ages here are discrete data. Then we have five workers here that are earning increasing amounts of money. The mean for this would be approximately 220,000, which is completely out of line with what this data actually looks like. So for this one, again, the median is the obvious choice. And again, we wouldn't use mode because we don't have a specific most frequent value. Next, we have four workers all earning these amounts, and then we have an owner earning, again, a million. For us here, the median, the mean, or the mode would work, depending on what we're trying to pick up on. And then, for the final one, this is the only one for which we actually have the mean as the most represented value. Because we have our data distributed in a way that keeps them all so close together, they're normally distributed, we can say. The rules are that, for nominal data, so for our unordered data, we want to use the mode, because you can't order nominal data. For ordinal data, the median is often the best statistic. For non-skewed interval or ratio data, it's best to use the mean. And then for skewed, ratio, or interval data, the median tends to be the best. So we've answered the first five of these questions. Now let's look at the sixth point. If data has outliers, which measure of central tendency is most appropriate? That depends on the data type, but it would normally be median or mode. In a normally distributed dataset, that is, the classic bell curve, mode, median, and mean are all the same. And for any dataset, which measures of central tendency have only value? Well, that would be the mean, because with mode, there can be more than one most frequent value, and with median, there can be two middle values, whereas a mean is always going to give you one specific value.
So now let's take a look at the mean. This is the mean, this is what it looks like. This is the formula to compute it. To start with, let's look at the sigma notation. At the bottom of a sigma notation, we would usually have i equal to and whatever our start point is going to be. The value to the right of the sigma tells us which values we are going to sum up from our data. And then the top tells us which index value we're going to go up to. So this simply says, for all x, for all xi in the column of xes, we're going to add them all up from start to the end, where the length of our data or the number of rows in our column or whatever we want to think of it as is equal to n. So we're just adding up. And so we denote different statistics. We denote statistics relating to the population and to the sample using different letters. When we are looking at population statistics, we use Greek letters and we call them parameters. So the population parameter, namely the mean, is given by the Greek letter mu. Whereas the sample mean, which is a statistic denoted by Roman letters, is given by x bar. Okay. So statistics are Roman parameters, population parameters will be denoted by Greek. This helps us distinguish between them, and we should know that the only difference that we have between these two computations is what it's going to be called. And we have a capital N on the bottom here and a lower case n on the bottom here. Simply giving the idea that the sample is going to be smaller than. What don't we have an idea of from central tendency statistics? One thing to note is that we don't get an idea of the spread of the data from central tendency statistics. So no idea of the variation of distribution. And how might we obtain insight into this? Well, in terms of a visual measure, we have distributions, frequency distributions, we have scatter plots, we have frequency distributions, bar plots, histograms, all that kind of thing. They're massively useful in exploratory data analysis. Frequency distributions are simply how often something occurs. Once we've got that, we get an idea of the overall shape of our distribution.
One way that we can do this is by using a bar plot, where the darker the dot is, the more data is there. So this data is skewed left and it appears most of the data is centered around a 3.5 ish mark. As a point of interest, the mean is centered around here. This is the balancing point between our data. If this was a plank of wood, this is where it would balance, given all these data points put onto it. This gives us a limited idea of what our data looks like. We have a stacked dot plot, which is very much a rudimentary histogram. It gives us more insight again into our distribution. So we can see that it's somewhat normally distributed. And again, there's a left skew. So the higher the bars, the more points we have in a place. But we're just doing counting here. We have bar plots for our categorical variables. Now, a bar plot is not a histogram. Bar plots are specifically to do with categorical variables that can't necessarily be ordered. So male and female are non-ordered values. On the right hand side, we have a relative frequency bar plot. On the left hand side, we have the absolute frequency bar plot. The transformation doesn't change the shape whatsoever, it simply changes the units that we have on the left and the right hand side. Now, the difference between a bar plot and a histogram is that histograms are essentially bar plots for continuous data. But we have to make a decision when we have a look at histograms in terms of bin width. Bin width can completely change the way that we view the data. And bin width can tell us what the distribution of our data actually is. So we just bucket between, for example here, zero and 10, count all the values that fall within this range, and then we have a plot representing the number of elements we have within that bin. So we can choose any bin size. But which of these histograms are actually useful? How many bins is too many and how few is too few? These histograms could all be useful, but in terms of revealing the distribution to us, I would argue that the top right and bottom left are most feasible. But useful information could still be obtained from the other two. For example, the bottom right tells you about bunchings at certain points within your distribution, and the top left gives you an idea that most of your data is below the 50 mark. They're all somewhat useful, and bin size does control what you're going to read about your distribution or what someone else see. We can have many different kinds of distributions, many different shapes of plots. Depending on the distribution, you will have certain kinds of treatments available for your data. Out of these histograms here, the mean is a good metric of an expected value for the U-shaped histogram.
So if we think about it in terms of expected value, the mean is going to be in the middle. What's the middle? The middle of this distribution is the least frequent. So we can think of it almost as the least likely value in this sense. So this gives an idea of the shape of our distribution. We are also interested in how spread out our distribution is, so we have things like standard deviation and variance. So we have a variance for standard deviation, or we've got things such as ranges, interquartile range, or these sorts of things. And this gives an idea of the spread around some central value. So we have a range, maximum minus minimum, that's easy enough. It gives you an idea of where the overall width we can think of as our data where all of our data falls. We have interquartile range. This is a slide discussing interquartile range. It's very, it has a lot of information there, but generally, the interquartile range simply means the range of values between the 25th and 75th percentile. So where the middle 50%, the bulk of your data is. So we simply take in this example here the 25th percentile, which is here. So the 25th percentile, which is 3.5, the 75th percentile, which is 7.5, and then we have our interquartile range being given by subtracting the 75th from the 25th. Simple enough. It tells us how much data fall within the central 50%, the range. To visualize the dispersion of our data, we can use box plots. We can use box plots also in Python. And they tell us where the central bulk of our data is. They tell us where the median is. They give us an idea of our outlier boundaries as well. How do we compute outliers with a box plot? Well, for our upper one, it's the point of quartile three plus 1.5 times the interquartile range. And then for our lower, we have quartile one minus one minus 1.5 times the interquartile range. There is one metric for outliers. You can use things like standard deviations as well, picking up on your outliers. And what are outliers? Outliers are often simply aspects of your dataset and you have to account for them. Sometimes, they can be erroneous. It's almost better when they are the result of error, because then you can just get rid of them or use some sort of decent treatment. It's worse when they're actually part of your data.
So variance gives us an idea of how far away our variables are from the mean. So we have x bar, which is sample mean. So what we are doing is we are summing all of the values in the dataset with the sample mean taken away from them squared. It's squared because it gets rid of the negatives. And then we divide by n minus one. So we divide by the number of data points minus one. We subtract one from the denominator, because we are fixing one of our values, we are fixing the mean, which reduces a degree of freedom within our data. So we have to subtract one from the denominator. It's called the Bessel correction and it gives us higher variance values for samples than for populations. Because when we have a population parameter, we don't actually need to subtract one. So if I was computing the population variance, then we would simply have the total number of data points on the denominator. But with the sample, because we're fixing a value, we have to take one away because we're reducing the degrees of freedom. So this calculation for our data is us simply running through and taking a data point and taking the mean and then subtracting them away from one another, squaring it, and then adding them all up. The issue with interpretability when it comes to variances is that the resulting value is in squared units in comparison to whatever we've been carrying out the variance on. So if you had your data in units of meters, the variance is in terms of meters squared. So the way that we can handle that is if we take the square root of the variance. So we should note that we have S squared. It's the name that we're giving to variance. If we take the square root of S squared, then we get this S, and S is the value for our standard deviation. It is simply taking the square root of our variance calculation. And that brings us down into the world of actual units. And what we end up with is a metric of the average deviation away from, in our example here, the mean. It's what a standard deviation gives us.
So the standard deviation is a massively useful statistic. It becomes very useful when we are working on regression problems, traditional statistical regression. It becomes very useful for us when we are actually using hypothesis testing as well. And it's something that generally just informs us about our data. So the formula that we have, our standard deviations data again, we have a population parameter denoted by the Greek, sample statistic denoted by the Roman letter, which is sigma. And then we have s for our sample standard deviation here. Now, the central limit theorem is a fundamental of statistics which says that, no matter what the distribution is, if you take enough samples of the expected value for the sample mean, for example, then the resulting distribution of sample means will be themselves normally distributed, no matter how crazy the distribution is that you are sampling from, as long as you have a finite level of variance.
Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.