Feature Preprocessing
Start course
2h 4m

Machine learning is a branch of artificial intelligence that deals with learning patterns and rules from training data. In this course from Cloud Academy, you will learn all about its structure and history. Its origins date back to the middle of the last century, but in the last decade, companies have taken advantage of the resource for their products. This revolution of machine learning has been enabled by three factors.

First, memory storage has become economic and accessible. Second, computing power has also become readily available. Third, sensors, phones, and web application have produced a lot of data which has contributed to training these machine learning models. This course will guide you to the basic principles, foundations, and best practices of machine learning. It is advisable to be able to understand and explain these basics before diving into deep learning and neural nets. This course is made up of 10 lectures and two accompanying exercises with solutions. This Cloud Academy course is part of the wider Data and Machine Learning learning path.

Learning Objectives

  • Learn about the foundations and history of machine learning
  • Learn and understand the principles of memory storage, computing power, and phone/web applications

Intended Audience

It is recommended to complete the Introduction to Data and Machine Learning course before taking this course.


The datasets and code used throughout this course can be found in the GitHub repo here.



Hey guys welcome back. In this video, I wanna talk about two techniques, that fall into the main of preprocessing. One is, first one is called One-Hot encoding and it's useful when we have more than two classes. So let's load some data, weight and height here. And use the function get_dummies on the column of gender. Okay so let see what happens. This function, creates new data frame with as many columns as the unique values in the column that was passed. And for each column it will give zero or one, depending on whether the value was the one that is set in the first column or the second column. So, to be explicit here, let's see how many values there were in the gender column. So, by now you should know that if I call the method called unique. It tells me all the unique values in the gender column. And these are only male and female. So get_dummies creates two columns. I've set the prefix to be gender, if I don't do that it will just call them female and male. Well, I've set it to be gender, the prefix and so it's gonna be gender prefix to female and gender prefix to male. And since the first rows are all males. These all have a one here and a zero here. This is useful when we have more than two categories as labels as we will see later on. The other thing I want to mention is neural networks work much better with features that have scale that is close to one or between zero and one. So in order to reach that, we can rescale our features and there are various ways of doing it. We can by hand rescale our features, for example we can define the height in feet instead of inches.

 By dividing by the number of inches in a foot. Or we can rescale the weight to 100 of pounds. This will give us ranges for features that are comparable to one, so the weight will vary between 0.61 and 2.70 and the height will vary between 4.5 and about 6.5 If we want to precisely rescale our features to be between 0.1, by the way this is good enough, but we can also call, the MinMaxScaler from preprocessing library. And the way it works is we first define it. So, instantiate the MinMaxScaler and then call the method fit_transform on each of the columns that we want to transform. So, if we do this now the MinMaxScaler has rescaled our features to be exactly between zero and one. So, the minimum value is zero now. And the maximum value is one. Noted that I've stored those in separate columns. Okay and I'm using here the describe function. Another way of normalizing the features is to use the standard normalization. What this does is it scales the data. So that the min of the data is zero. And the standard deviation is one. So, see now in the last column, the weight standard scaler, the min here is zero where as before it had some whatever value and the standard deviation is exactly one. So, let's plot the histograms of these features to compare them and you can see that the histograms look exactly the same. But the scale of the data has changed. 

So, we see that the height, the row feature has values between about 55 and 80 inches. But when we rescale it to feet we have height between 4.5 feet and 6.5 which is totally the reasonable scale. And in the MinMaxScaler function it is exactly between zero and one. Whereas in the standard scaler version of the feature, it has the min centered on zero. And the standard deviation of approximately one. Okay, so all these methods, are valid, the important thing is that we should give our model features that are more or less close to one in size. So this would not be a great feature for more complex neural network model, we should rescale it first. Thank you for watching and see you in the next video.

About the Author
Learning Paths

I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-­founder at Spire, a Y-Combinator-­backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.