Machine Learning Pipelines with Scikit-Learn
The course is part of this learning path
This course is the second in a two-part series that covers how to build machine learning pipelines using scikit-learn, a library for the Python programming language. This is a hands-on course containing demonstrations that you can follow along with to build your own machine learning models.
- Explore supervised-learning techniques used to train a model in scikit-learn by using a simple regression model
- Understand the concept of the bias-variance trade-off and regularized ML models
- Explore linear models for classification and how to evaluate them
- Learn how to choose a model and fit that model to a dataset
This course is intended for anyone interested in machine learning with Python.
To get the most out of this course, you should have first taken Part One of this two-part series.
The resources related to this course can be found in the following GitHub repo: https://github.com/cloudacademy/ca-machine-learning-with-scikit-learn
Welcome back. In this lecture, we are going to cover an important family of linear models that are used for classification. In particular, we are going to focus on the logistic regression model.
Now, suppose we wish to classify our customers as either fraudulent or nonfraudulent. We can think of using a linear model such as linear regression and try to find which set of features has the best impact on the estimation procedure. This basically translates to finding the vector of betas that minimizes the missclassification numbers. But what do we mean by misclassification? Well, in terms of loss function, regression models typically minimize the sum of squares of errors but if we assume the target value is one, as in binary classification, then such quadratic loss penalizes large deviations from it. Hence this argument is not ideal for classification, being close to one value doesn't mean anything in classification problems since we are interested in correctly classifying the examples in the correct class or not.
So, we could use a zero, one loss which takes the sum of incorrect misclassification. So for example it could be that, if we have correctly predicted, then the loss would be zero, otherwise it would be one. Unfortunately, such a function is hard to minimize, so we might want to use a smoother version of such a loss function, which is using a logistic regression called the log loss function. And the cool thing about this function is that it transforms any continuous input into a zero, one outcome, thanks to the sigmoid function.
So just to give you an idea, if the outcome is binary, meaning it can assume only two values, zero or one, we employ the sigmoid function to minimize the log loss function on the training data set and typically this operation is performed in order to maximize the probability that the example belongs to the positive class. And by the way the probability of belonging to the negative class is easily obtained by using the second axiom of probability.
So this is an example of a logistic function, we see on the x-axis we have the raw output model which typically ranges from minus infinity to infinity and then on the y-axis we have a probability between zero and one of the statistical observations that belong to the positive class. So typically, when we read this plot, we are implicitly setting a classification threshold at 0.5 if the raw model output is positive, then we set y equal to 1. Otherwise if the raw model output is negative, then we classify that example as zero, namely to the negative class.
So lets move to a Jupyter Notebook. As you see here we are going to use a diabets.csv file, so we import that file and we split the data into features and targets. The features are nothing more than a set of variables related to some medical observations. So here for example, we have glucose, insulin, BMI, et cetera as columns, whereas the target is just a factor of 1 and 0, with 1 meaning that diabetes was observed and zero meaning that it wasn't.
We also import the training test split and we split the data into a training test set, and to fit a logistic regression model in scikit-learn is pretty easy. It is a member of the linear model, and so we import the logistic regression function from that class. We initialize a model, and in particular, we set the max iteration argument as 10,000. This argument controls the routine that is used to get the convergence of the algorithm and therefore the minimization process.
Then we fit a model on the training data. And we predict on the test set, and we store this as y_pred. So, y_pred is nothing more than a vector of zero and one, but to give you an idea, let's suppose we are interested in the X test in position iloc zero.
So for this set of features we have predicted a zero. Why is that? Well, if we consider the coefficient coming from our estimation, and we use those coefficients and we use that to perform a multiplication with the metric X test, and we also sum the intercept, since we are in a linear model, we get a negative value for the raw model. And if you remember, negative values associated with the raw model imply that the example is classified as belonging to the negative class. And that's why we see a zero here. And we have a classification threshold of 0.5.
Now if instead of index zero you take into account, the observation with index nine, you have a positive value for the raw model and therefore this is estimated as one, that is, it belongs to the positive class. If instead we want to get the raw probabilities, we can employ the predict_proba method, which basically ingests, once again, X test and it returns a list of probabilities for each example.
So let's print just the first five. It returns an array of lists, each of them containing two values. The one on the left describes the probability that the example belongs to the negative class, whereas the one on the right describes the probability that the example belongs to the positive class. So if you remember here, we have initialized a logistic regression model with this argument max_iter equal to 10000.
This is obviously a parameter of the model, but the logistic regression has other parameters, such as the C parameter which controls regularization. And by default, the logistic regression applies an L2-regularization, like the one we applied in the Ridge regression model. This can also be controlled by the argument penalty, and is by default set to L2. Let's set C equal to one. And finally, we have the max iteration and we set this to 10000.
So, the natural question now is, can I use the grid search to look for those parameters? And the answer is yes, so let's go ahead and do that now. So to do that, we are going to define a C space, which is defined as the logspace coming from the NumPy function logspace. We define this as being from minus five and eight and we define 15 different values. So we import NumPy as np as well and then we are going to initialize a logistic model and we are going also to import the gridsearchCV method from the model selection submodule. And then we are going to pass logreg, param_grid, which we have to define, obviously.
So let me say, param_grid here is going to be a dictionary. A dictionary made up of C, which stands for the C parameter, with c_space as the value. And we are going to define the penalty, and we pass as values either L1 or L2. So, we pass param_grid and we define the cv argument as equal to five and then we store this into the logrec_cv variable. And we fit a model to the training data. There is a parenthesis missing. Also, we haven't imported the gridsearchcv, so here we go.
Now in order to get the best params, we can access the best_params attribute, and we see that c is equal to 31, and the penalty is L2 by default. And then we can ask what's the best score observed? So we use best score and it's 0.77 coming from the gridsearch. So, the best model that we store in the best model variable comes from the attribute best estimator, and therefore we can use this best estimator to predict on X test and here we get some predicted values. And here is the result.
So, the natural question now would be, how can I evaluate a classification model? Well, we can use at least two different methods. So, the general one is to use the confusion matrix that we can import from model selection. So we import the confusion matrix, actually so not model selection but metrics, and in particular it works as follows. The confusion matrix requires y_test and y_pred, and produces a matrix that must be read as follows.
Typically, the diagonal contains the true negative and the true positive, whereas this one is the false positive and this is called the false negative. So just to give you an idea, an example is said to be a true positive if it was predicted as positive and belongs to the positive class. It's a false positive if it was predicted as positive but actually belongs to the negative class. It's false negative if it was predicted zero but actually belongs to the positive class, and finally it's a true negative if it was predicted as negative and it belongs to the negative class.
So in general, we want to minimize the number of false negatives and maximize the number of true positives, but that depends on the use case under investigation. And from the confusion matrix we can typically extract other information such as precision and recall. And we can compute them directly using scikit-learn. And we can use the classification report. And the classification report is pretty useful since it's going to give you in just one single line the whole model performance with respect to the two classes.
So let me print this, so that it's more readable. So, it's going to give you a matrix related to precision, recall, and the F1 Score with respect to the negative and positive classes. Essentially, there is a tradeoff between precision and recall, meaning that you can't maximize both precision and recall at the same time. So the higher the precision, the lower the recall, and vice-versa.
So in order to overcome this problem, you can typically use the F1 Score which is basically a mixture of the two measures. But the bottom line here is that we've got quite good results. The precision figure for the zero class is 0.82 and the figure for the positive class is 0.66. And there are many other measures used for binary classification.
Among those, it is worth mentioning the ROC-AUC curve, which computes the ratio between precision and recall for different thresholds, but we are not going to investigate that in this course. You can check that out in the official documentation if you're interested. Please find the link in the transcript below. That concludes this lecture. I'll see you in the next one where we'll wrap up the course.
Andrea is a Data Scientist at Cloud Academy. He is passionate about statistical modeling and machine learning algorithms, especially for solving business tasks.
He holds a PhD in Statistics, and he has published in several peer-reviewed academic journals. He is also the author of the book Applied Machine Learning with Python.