Cross Validation
Start course
2h 4m

Machine learning is a branch of artificial intelligence that deals with learning patterns and rules from training data. In this course from Cloud Academy, you will learn all about its structure and history. Its origins date back to the middle of the last century, but in the last decade, companies have taken advantage of the resource for their products. This revolution of machine learning has been enabled by three factors.

First, memory storage has become economic and accessible. Second, computing power has also become readily available. Third, sensors, phones, and web application have produced a lot of data which has contributed to training these machine learning models. This course will guide you to the basic principles, foundations, and best practices of machine learning. It is advisable to be able to understand and explain these basics before diving into deep learning and neural nets. This course is made up of 10 lectures and two accompanying exercises with solutions. This Cloud Academy course is part of the wider Data and Machine Learning learning path.

Learning Objectives

  • Learn about the foundations and history of machine learning
  • Learn and understand the principles of memory storage, computing power, and phone/web applications

Intended Audience

It is recommended to complete the Introduction to Data and Machine Learning course before taking this course.


The datasets and code used throughout this course can be found in the GitHub repo here.



Hello and welcome to this video on cross validation. In this video, you will learn what cross validation is, how it improves train/test split and a few different ways to perform cross validation. A train/test split is not the most efficient way to use our dataset and assess the out of sample error. Even if we took great care in randomly splitting our data, that's only one of many possible ways in which we could perform a split. What if we perform several different train/test splits, check the test scores in each of them and finally average the scores. Not only we would have a more precise estimation of the true accuracy, but also we could calculate the standard deviation of the scores and therefore know the error on the accuracy itself. There are many ways to perform cross validation. The most common one is called the k-fold cross validation. In k-fold cross validation, the whole dataset is split into k equally sized randomly sampled disjoint subsets. Then, several rounds of train/test split are performed and in each round, one of the subsets is used for testing while the others are aggregated back to form a training set. In this way, we obtain k estimation of the model score. Each calculated from a test set that does not overlap with any of the other test set used in the other rounds. We can then calculate the average accuracy by averaging the accuracy over the rounds, and we can also calculate the standard deviation of such accuracy. These advantages are not free. To perform cross validation we had to train the model several times, which takes longer and consumes more than training the model just on a single train/test. On the other hand, we obtain the better estimate of our out of sample error which makes us more confident about how good our model is at generalizing. 

The good news is that this is a completely parallel problem since each fold is a totally independent training process, we can parallelize cross validation over folds, either by distributing each fold to a different process on the same computer, or even across different computers on the same network. K-fold is just one of many ways to perform a cross validation. Here are a couple of other methods. Stratified k-fold is similar to k-fold but it makes sure that the proportions of labels are preserved in the folds. Let's consider for example, a binary classification where 40% of the data is labeled True and 60% of the data is labeled False. A stratified split means that each of the folds will also contain 40% True labels and 60% False labels. In other words, we're keeping the ratio of labels when we do our k splits in the cross validation. Notice that this has nothing to do with the ratio of train to test sizes. We could still be choosing 20% of the data for testing and a stratified split would mean that 40% of that 20% is composed by data with a label of True. Finally, it is worth mentioning leave one label out cross validation and leave p labels out cross validation, or LOLO and LPLO. These are useful when there are subgroups in our data. Imagine you wanted to build a model to recognize if a user is running or not from the accelerometer data on their phone. 

Your training dataset probably contains multiple recordings of different physical activities from different users. The binary targets you're trying to predict indicate the physical activity state of a person, is she running? Yes or no. However, there's an additional column in your table which indicates the user ID connected to that particular data point. If you perform the simple cross validation, both your training and test set would likely end up containing records from all our users. If you train the model in this way, you could obtain a good test score but have no idea about how well the model would perform on data from a completely new user. In this case, it would be better to split the data by user, assigning data from some of them as training and data from other users as testing. In this case, if the test score is good, you can be fairly sure that the model will perform well also with new users. Another way to say this, is to say that in this case, we want the model to be able to generalize across users, not only on unseen data from a previously seen user. This kind of cross validation is called leave one label out if we leave only one user out for testing, or leave p labels out if we leave more than one user our for testing. Notice that in this case, the word label does not refer to the target variable of our classification model but to the user ID used to split the data. In conclusion, in this video you have learned about cross validation, which is a better way to estimate how well your model is able to generalize. We've discussed computing and time cost and how to use parallelization to speed up the cross validation calculation. Finally, it is important to decide the appropriate method for cross validation depending on which problem you are trying to solve. So thank you for watching and see you in the next video.

About the Author
Learning Paths

I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-­founder at Spire, a Y-Combinator-­backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.