The Bias-Variance Trade Off
Start course
1h 9m

This course is the second in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.

Learning Objectives

  • Explore supervised-learning techniques used to train a model in scikit-learn by using a simple regression model
  • Understand the concept of the bias-variance trade-off and regularized ML models
  • Explore linear models for classification and how to evaluate them 
  • Learn how to choose a model and fit that model to a dataset

Intended Audience

This course is intended for anyone interested in machine learning with Python.


To get the most out of this course, you should have first taken Part One of this two-part series.


The resources related to this course can be found in the following GitHub repo:


Welcome back. In this lecture, we're going to cover a very important aspect of any machine learning pipeline and that's The Bias-Variance Trade Off. If you remember from the last lecture we fitted a linear regression model and we saw that a linear model performs pretty well on the Boston dataset.

Linear models typically work very well when we have a dataset that is described by a number of rows larger than the number of columns, also known as features. However, there are many cases where datasets are rectangular, meaning that the number of features is greater than the number of rows. When this happens, the ordinary least squares or OLS model does not typically perform very well and therefore we need different models and one of those is that regularized model.

But before introducing a regularized model, we have to understand the bias variance trade off. In particular, bias is defined here as the difference between the models average prediction and the true population value that we aim to predict. A model with a high bias, pays very little attention to training data and oversimplifies the model. This always leads to a lot of errors on both the training and test data. And when we have bias models, they tend to be very simple, meaning that we arrive at extreme regularization.

To give you a little more context, think of having a very simple model, such as a constant line that has been fitted to your data. This basically oversimplifies reality, and it doesn't learn anything about the underlying data and its patterns. On the contrary, variance is the uncertainty about a model's prediction. A module with very high variance pays a lot of attention to training data and does not generalize the data which it hasn't seen before, namely new data that you get for the prediction. As a result, such models perform pretty well on the training data, but not so well on the test data. And in this case, they're going to have extremely high error rates on the test data. And when you find yourself in a scenario like this, you are typically facing a situation of overfitting.

So, if you remember, as I said before, when we fit an OLS model, we typically tend to minimize an objective function that is based on the mean squared error. So, why do we choose a mean squared error in OLS? Well, we have two explanations for this. First, we have a mathematical explanation. Basically the mean squared error can always be written as the sum of the variance and the bias. And therefore, when using an OLS model, the bias is always equal to zero. Therefore, OLS focuses only on how well the model fits, i.e. on the variance and it tries to find the parameters, the best parameters, that minimize the sum of squared errors. And then you have a sort of interpretability.

Think about the bias in terms of precision. Models with high bias and not precise at all, right? Because actually they're not going to learn anything about training data and you can think of the variance as a sort of accuracy. The variability around a prediction might depend on the model and therefore you are going to get different values based on different data, given the model you're using. And therefore, you can interpret variance as a sort of accuracy measure for your model.

Okay. So, if this is true, we can basically distinguish between two cases, underfitting and overfitting. That might happen when we try to fit our supervised learning task, namely a regression or a classification. Underfitting happens when a model is not able to capture the underlying patterns of the data. These models usually have high bias and low variance. It happens when we have very small amounts of data to build an accurate model or when we tried to build a linear model with no linear data. This is the case when you try to fit a linear model and the model is not linear.

On the other hand, if our model has a large number of parameters, that is going to have high variance, which typically leads to overfitting. An over fitting happens when our model captures the noise, along with the underlying pattern in the data. It happens when we train our model a lot with a noisy dataset, and these models typically have low bias and high variance. And these are models that typically are more complex than OLS. Think about a decision tree, which is prone to overfitting.

So actually, the take home message is that we have to find a good balance between overfitting and underfitting and the optimal fitting is a situation where we try to fit a model that actually is not going to be too biased by all the noisy data, the outliers, but tries to fit more to the general pattern of the data. And for classification as well, we have to find a threshold that allows us to build the decision boundary so that the red points are well separated from the blue ones. But obviously, there can be areas that are described by these points and the trade off in complexity is why there is a trade off between bias and variance.

An algorithm can be more complex and less complex at the same time. So we, therefore, have to find a good balance. And one possibility for extending a linear regression model is to add some bias inside it. And that's why we typically use regularized models. Therefore, in the next lecture, we're going to discover how a regular model works and in particular, we're going to focus on one member of that family, which is the ridge estimator. I'll see you there.

About the Author
Learning Paths

Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.

He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.