Machine Learning Pipelines with Scikit-Learn
This course is the first in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.
- Understand the different preprocessing methods in scikit-learn
- Perform preprocessing in a machine learning pipeline
- Understand the importance of preprocessing
- Understand the pros and cons of transforming original data into a machine learning pipeline
- Deal with categorical variables inside a pipeline
- Manage the imputation of missing values
This course is intended for anyone interested in machine learning with Python.
To get the most out of this course, you should be familiar with Python, as well as with the basics of machine learning. It's recommended that you take our Introduction to Machine Learning Concepts course before taking this one.
The resources related to this course can be found in the following GitHub repo: https://github.com/cloudacademy/ca-machine-learning-with-scikit-learn
Congratulations, you've reached the end of this course. We've gone through quite a few things here, so let's just have a quick recap to look at what you've learned. We covered the family of transformers that are, typically, used to pre-process data, and therefore they're used in a machine learning pipeline before the fit of a model.
We learned how to apply the Standard Scaler, which is quite a benchmark for performing pre-processing with heterogeneous variables. We covered different techniques, such as the Robust Scaler, and understood when those techniques should be used. We applied techniques that are useful for encoding categorical variables into dummy variables, and understood how to select the columns to encode via make column transformers.
Finally, we covered the imputers class, which can be used to impute missing values in the data. In particular, we covered the simple univariate imputer and two multivariate techniques, namely the iterative imputer and the KNN imputer. Those two multivariate methods should be used when you assume there might be some dependence among variables in our dataframe. I hope you enjoyed this course and found it useful. If you have any feedback on it at all, please feel free to reach out to us at firstname.lastname@example.org. Thanks for watching.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.