- Home
- Training Library
- Programming
- Courses
- Statistics and NumPy

# Standardization in Data Science

## Contents

###### Statistics

## The course is part of this learning path

This course delves into the theory behind the topics of statistics, distributions, and standardization, all of which give you a solid foundation upon which the field of data science is built. We look at a variety of aspects of the field of statistics and how to use statistical tools to analyze and interpret data. You will then be walked through the NumPy library and how it can be used in a variety of real-world situations.

**Learning Objects**

- Understand the different types of data and the relationships between them
- Understand the different way of finding the average of a set of data
- Know which statistical tools are available for analyzing data
- Grasp the impact that the distributions of data have on data analysis
- Learn about standardization and its use cases
- Explore NumPy library and its computational and statistical tools

**Intended Audience**

This course is intended for IT professionals looking to learn more about data analytics and the NumPy library.

### Prerequisites

To get the most from this course, you should already have some basic statistics knowledge as well as some programming experience.

Hello and welcome back. Let's talk about standardization and why standardization is a useful thing. We can say x is approximately normal or x follows a normal distribution by using a tail denotation and then specifying the parameters, mu and sigma being the population mean and the population standard deviation. We can also obtain something called standard normal deviation where we move our data so they're just centered around zero. One standard deviation is one center deviation away from the mean, and we can do this using a simple transformation. A z-score is often what we use to standardize our data. A massive benefit of standardization is that it allows our data to be comparable. Another huge benefit is that we take a very specific distribution, a distribution of data that is specifically to do with your problem at hand, so you might be joined to model, for example, the happiness of people and your data has a certain distribution. It's got a certain mean. It's got a certain standard deviation, and that means that the areas under the graph are entirely unique for different portions of your data, but that also means that we don't really know anything about that distribution. Everything has to be computed for the first time when you're working with that dataset. However, if you perform a standardization using something like a z-score, if you have a normally distributed data, we know lots and lots and lots of things about the standard normal distribution. It has known area values under certain segments of it. We know where most of the data's going to lie. We know how to obtain very, very quickly and easily probability values of data being above or below certain points. It's a widely studied distribution, centered at a certain point with no areas. So this distribution of exam scores is centered at 23. One standard deviation away appears to be 30, below is 16, but the area under the curve between 23 and 30, it'll be quite hard for us to get an idea of that if we perform a standardization, then we know the areas under every single segment of this curve. It's massively useful and it gets rid of the units. Why does it get rid of the units? It gets rid of the units simply by nature of the calculation before standardization because what we end up with is we have a z-score, which is calculated with the value minus the mean. If you take something with units away from something else with units, then what you end up with is just that thing in units, and then when you divide by the standard deviation, which is in units again, if you divide something in units by something else in units, you cancel out the units, and you end up in a dimension-free space. That's how standardization works. So going from this exam score where units are points, or whatever we want to dictate the exam by, we've not transformed this into a space where the only unit is the number of standard deviations away from the mean. So this here is one standard deviation away, and this is two standard deviations away from the mean. We've quantified our entire dataset simply in terms of the number of standard deviations away from a point, and for that, we can compare with anything. We can hypothesis test for these and so on and so forth. So we have known probability values for segments of our data. So for this one here, we know that above zero, we have 50% of our population. We know that 95% of the data falls between minus 1.96 and 1.96 standard deviations away from mean. Now this one, we've had to define a very specific figure. We've had to compute this. So 95% of the data falls between 9.1 and 36.9, but every single distribution that we take from the dataset will be completely different if we standardize, and we end up with a world that we understand, that we know everything about. So this is very useful. We have an example of why this is a good thing in standardized marking on exam scores. So all they do is they standardize scores so that people, instead of failing an exam if they fall below an arbitrary threshold of something like 50%, instead of that, you fail people who are a sufficient number of standard deviations away from the body of people who take an exam, and so, in this case here, we're failing two people as opposed to if we had an arbitrary threshold, in that case, we would fail a lot more people. So this is how they standardize marks. There are pros and cons to all of these approaches, but this is, I would say, a more statistically rigorous way of evaluating performance.

Delivering training and developing courseware for multiple aspects across Data Science curriculum, constantly updating and adapting to new trends and methods.