Exercise 2: Solution
Machine learning is a branch of artificial intelligence that deals with learning patterns and rules from training data. In this course from Cloud Academy, you will learn all about its structure and history. Its origins date back to the middle of the last century, but in the last decade, companies have taken advantage of the resource for their products. This revolution of machine learning has been enabled by three factors.
First, memory storage has become economic and accessible. Second, computing power has also become readily available. Third, sensors, phones, and web application have produced a lot of data which has contributed to training these machine learning models. This course will guide you to the basic principles, foundations and best practices of machine learning. It is advisable to be able to understand and explain these basics before diving into deep learning and neural nets. This course is made up of 10 lectures and two accompanying exercises with solutions. This Cloud Academy course is part of the wider Data and Machine Learning learning path.
- Learn about the foundations and history of machine learning
- Learn and understand the principles of memory storage, computing power, and phone/web applications.
It is recommended to complete the Introduction to Data and Machine Learning course before taking this course.
The dataset used in exercise 2 of this course can be found at the following link: https://www.kaggle.com/liujiaqi/hr-comma-sepcsv/version/1
Hey guys welcome back, let's look at exercise two. It's a bit more challenging than exercise one. We are asked to predict whether or not an employee will leave, based on a bunch of features. How satisfied the employee was, last evaluation, the projects, et cetera, et cetera. So there were a bunch of guiding steps: loading the file; establishing a benchmark. This is really important, guys. Whenever you build a machine learning model, you should be aware of what's the dumbest thing that I could do? What's the easiest model I can build? That gives you your benchmark. Then, we check for rescaling; we make plot, sorry. And we check for rescaling, may plot the histogram of the feature and decide the rescaling method. Convert the categorical features into binary dummy columns. And then the usual train, test split, play around with learning rate. Check the confusion matrix, precision and recall and check if we still get the same results with a 5-Fold cross validation. All right! Let's do this. So first of all, we log the data. Look at it. So we have a bunch of numerical features and a few categorical features: sales and salary. Okay, so we check at the info and we see that everything is numeric, except sales and salary, which are strings, as we already saw. Can also run describe, to check the ranges. We see that minimum values are all pretty low, with the exception of average monthly hours. And same for maximum values, the units range with except of the hours and the time spent at the company. Which seems reasonable.
So maybe we have to normalize the average monthly hours. Let's look at the benchmarks first. So left is the column that we're gonna be using as target. It's a binary column where one indicates that people left and let's check how many people are staying and how many people left. So, 20 almost 24% of the people have left. So, predicting that everybody stayed would yield an accuracy of 76%. In other words our benchmark, if you're predicting leaving is we have to be better than 76%, in order for the model to make any useful recommendation. Okay. We talked about the average monthly hours, that we may need to rescale them. Let's plug the histogram. And yeah, it's between a hundred and 300. So we're going to rescale them by just dividing by a hundred. We're going to rescale that and if we plot the histogram of this, average monthly hours, 100. It's absolutely gonna be between one and three.
Okay, so it's between and one and three, same shape. Great. Time spent at the company, it's gonna be the distribution between two and 10, it's probably years. Not gonna rescale that, I'll leave as it is. And, dummies. So we've seen these pd get dummies before. It transform categorical columns into binary dummy columns. So, the dummies are these zero, one columns where you have sales, product manager, sales-sales, sales-support, sales-technical and then salary-high, low, medium. Okay, so we have a bunch of binary features now, plus the other features. These are all the columns we originally have. So, what we're gonna do is use this concatenation function to concatenate the features we care about and the dummies along the horizontal axis. Which means we're going to concatenate along the columns. And then we take the values of this, and those are gonna be our features. And left are gonna be our target. So, we have almost 15000 points. It's a bigger data set. And 20 features and one variable to predict. So we do the usual train, test split with x and y, and build a model. Notice that this time we're predicting a binary variable, left, it's either zero or one. So it's classification problem and we need to use a logistic regression.
So, the only thing we're gonna change is, the equal dimension is 20, because we have 20 features; and we introduce the activation function, then it's gonna be a sigmoid. Last, we are gonna be using the binary cross entropy loss as our loss function. Still a shallow model with just inputs and outputs, one activation function and that's it. Okay, so we built our model now and we can check it just for completion, summary. So the model summary is we have one Layer and it has 21 parameters. Why 21? Well, we have 20 quotations for our 20 features plus one bias-ster, right. So it's 21 total. Okay, we can fit the model, it's gonna fit for 10 epochs. We see that each epoch, the model is learning but the loss is not really going down, is it? In fact very little improvement. If we're not satisfied with that we can just run another 10 epochs. Then I'll just rerun it in the same cell. It's good that we're monitoring the loss to see if it improves, but it doesn't seem to be improving that much. So, after these 10 epochs we just call it a day and see what's going on. So, predict classes will take the test features and give use zero, one predictions.
So it's what we did, similar to what we did with the greater than 0.5 in the previous exercise. We can do now a confusion matrix between our test predictions and our true variables, which are the test labels. So we import the confusion matrix and we use again the pretty confusion matrix function that we've already used before. When we do the pretty confusion matrix of y test and y test prediction, we will know how many people stayed and how many people left and what were the predictions for these two classes. So let's look at it. Okay, so pretty much, our model is saying that everybody stayed and predicting a lot less people to leave. So we have a lot of false negatives. And a few false positives. And a very small number of true positives. So let's look at the classification report. And we see that, that's kind of in line with what we expect. So the class of people who left is actually smaller but precision and recall for that class are actually terrible. That's probably due to the fact that our model is not powerful enough to actually deal with this data set.
But, given what it is, we'll try to build it. Now we'll have to decide what to do? The first thing we're gonna do is to try cross validation. So, to be sure that the model is actually this bad, first check we can do is run it multiple times on different slices of the data and see it still doesn't perform that well. So, we've done this before. No KerasClassifier. And define a function that will build our model. So the KerasClassifier will take the build logistic regression model and return a scikit-learn compatible version of it. And then we do a 5-Fold cross validation where we store the five scores into these score objects. So this will take some time. But at some point it will finish executing. So, while that's finishing, I'll show you what the next line says. So the next line takes the mean and the standard deviation of the score and puts them into this nice printing with four significant digits after the dot. And prints it in a string that says, "The cross validation "accuracy is the mean plus or minus the standard deviation." So it finished executing and we get this cross validation accuracy of .7475 plus or minus .005. So let's check the scores.
They're all pretty much the same. We got 77 here, 79 here, 76 here, well this is actually 80. Some very low score in this faction and higher score in this. So, there is some fluctuations. This was a very bad run. This was a somewhat better run. But the question is, is the model good enough for our boss? Well, it doesn't really perform any better than the benchmark. I mean not significantly better anyways. No it's not better than the benchmark. So, we're left with the question of what do we need to do next? Do we need to go out and get more data? Use a more powerful model? It will take us a few chapters to actually answer that question. But I think it's an interesting point where we got to. Data science and machine learning always work for... Machine learning always works through iterations. So you try an idea, see if that gives you a better score than what you wanted. And if so you're done, if not, you gotta go back to drawing board and think of something else. So I hope you had fun doing this and see you in the next video, with some more content on neural networks.
The datasets used in exercise 2 of this course can be found at the following link: https://www.kaggle.com/liujiaqi/hr-comma-sepcsv/version/1
I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-founder at Spire, a Y-Combinator-backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.