Start course
2h 4m

Machine learning is a branch of artificial intelligence that deals with learning patterns and rules from training data. In this course from Cloud Academy, you will learn all about its structure and history. Its origins date back to the middle of the last century, but in the last decade, companies have taken advantage of the resource for their products. This revolution of machine learning has been enabled by three factors.

First, memory storage has become economic and accessible. Second, computing power has also become readily available. Third, sensors, phones, and web application have produced a lot of data which has contributed to training these machine learning models. This course will guide you to the basic principles, foundations, and best practices of machine learning. It is advisable to be able to understand and explain these basics before diving into deep learning and neural nets. This course is made up of 10 lectures and two accompanying exercises with solutions. This Cloud Academy course is part of the wider Data and Machine Learning learning path.

Learning Objectives

  • Learn about the foundations and history of machine learning
  • Learn and understand the principles of memory storage, computing power, and phone/web applications

Intended Audience

It is recommended to complete the Introduction to Data and Machine Learning course before taking this course.


The datasets and code used throughout this course can be found in the GitHub repo here.



Hello and welcome to this video on classification. In the previous lessons, we have learned about Linear Regression and how it can be used to predict a continuous target variable. We have learned about formulating a hypothesis that depends on parameters and about minimizing the cost to find the optimal value for such parameters. Can apply the same framework to cases where the target variable is discrete and not continuous. All we need to do is to adapt the hypothesis and the cost function. In this video, you will learn how to adapt the Hypothesis for a classification problem, you will learn about Logistic Regression which is a technique to solve classification problems. You will learn to define a cost that works for classification and you will learn about Accuracy, which is the score we will use for classification. Let's start with an example. Let's imagine we are predicting the purchase behavior of our website user. For instance, let's say we are building a model to predict whether user is going to buy a product, based on how many seconds he or she spends on the product page. In a similar way to the Regression Case we will have one feature, the time, in minutes, and one label, whether the user bought the product or not. In this case, however, the outcome variable is binary. The user either buys or doesn't buy the product. So how can we build a model with a binary outcome? We describe many techniques to build a classification model in a separate course on Machine Learning where we cover, for example, K Nearest Neighbor, Decision Trees, Support Vector Machines and Naive Bayes. In this course we will focus on one particular method, the Logistic Regression. Despite the name being regression this is actually a classification technique. The Logistic Regression models the probability of the outcome variable with the logistic curve. In the example of the product purchase we can formulate the hypothesis as Y hat equals to a function of B plus X times W. 

Where the function F is called sigmoid and it is expressed by the formula F of it's argument Zed equals to the inverse of one plus the exponential of minus Zed. I know this is a bit math-sy, but don't worry. All that you need to know is that the graph of the sigmoid looks like the figure on the right. Now that we have defined the hypothesis. We need to define a cost. We can not use the Mean Square Error like the Linear Regression Case because in the classification case the Mean Square Error is not convex, which would make it hard to find the global minimum. A better cost in this case is the Log Loss or Cross-Entropy Cost. This is defined as follows. Let's start by defining the cost for a single point as the sum of two terms. Since the labels Y I can only be zero or one. Only one of these two terms will be present for each data point. Another way to read this expression is to say that the cost is equal to negative logarithm of one minus the predicted probability when Y I is equal to zero. And to the negative logarithm of the predicted probability when Y I or the label is equal to one. Let's look at each term individually. Let's start from the second term. Remember that Y hat, the probability contains the sigmoid function. So, it's negative logarithm evaluates to minus the logarithm of one plus E to the minus X. If X is really big, this quantity goes to zero.

 While if X is negative, this quantity goes to infinity in a linear way. In other words, When the label is one, we expect Y hat or the probability of our model to approach one. Which happens for large values of X in the sigmoid curve. So if X is positive and large. We make our cost very small. While if X is negative we make the cost larger and larger. The same logic applies to the first term when the label is zero. The contribution to the cost of this term will be low when X is pushed towards negative value, which makes Y hat approach zero in this case. Now that we have defined a cost for a single point we can define the total cost as the average of the cost for the individual points. This cost function goes by the name of Average Cross-Entropy or Binary Log Loss. We have defined a hypothesis and cost for our classification problem. Now we can go ahead and look for the best parameters that minimize this cost in a similar way to what we did for the Linear Regression Case. One final point, notice that our Logistic Regression Model predicts a probability. If we want to convert this to a binary prediction we need to decide how to convert it to a binary outcome. One way to do this is to set a threshold. For example, we could say that all points predicted to be one with probability greater than zero point five are set to one and all others are set to zero. 

With this definition we can also calculate a score for our model. This is the Accuracy Score and it's defined as the number of correct predictions over the total number of points. So, for example, in this table we have three correct prediction in a total of five attempts. This corresponds to accuracy score of 60%. Similarly the The Regression Case we can compare accuracy on the training set with the accuracy on the test set, and judge how well our classification model is doing when generalizing to unknown data. In conclusion, in this video we learned that classification problems can be handled in a similar way to regression problems by asking the model to predict the probability that a data point belongs to a certain class. We have learned ho to use the Sigmoid Function to map all the numbers predicted by a linear function on to the interval zero one of probabilities and we've learned to define the Log Loss as the preferred cost for binary classification. Finally, we've learned about Accuracy which is the score we will use to judge how good a classification model is. So, thank you for watching and see you in the next video.

About the Author
Learning Paths

I am a Data Science consultant and trainer. With Catalit I help companies acquire skills and knowledge in data science and harness machine learning and deep learning to reach their goals. With Data Weekends I train people in machine learning, deep learning and big data analytics. I served as lead instructor in Data Science at General Assembly and The Data Incubator and I was Chief Data Officer and co-­founder at Spire, a Y-Combinator-­backed startup that invented the first consumer wearable device capable of continuously tracking respiration and activity. I earned a joint PhD in biophysics at University of Padua and Université de Paris VI and graduated from Singularity University summer program of 2011.