Contents
Introduction
Machine Learning Concepts & Models
Case Studies
This course explores the core concepts of machine learning, the models available, and how to train them. We’ll take a deeper look at what it means to train a machine learning model, as well as the data and methods required to do so. We’ll also provide an overview of the most common models you’re likely to encounter, and take a practical approach to understand when and how to use them to solve business problems.
In the second half of this course, you will be guided through a series of case studies that will show you how to apply the concepts covered in this course to real-life examples.
If you have any feedback relating to this course, feel free to contact us at support@cloudacademy.com.
Learning Objectives
- Understand the key concepts and models related to machine learning
- Learn how to use training data sets with machine learning models
- Learn how to choose the best machine learning model to suit your requirements
- Understand how machine learning concepts can be applied to real-world scenarios in property prices, health, animal classification, and marketing activites
Intended Audience
This course is intended for anyone who is:
- Interested in understanding machine learning models on a deeper level
- Looking to enrich their understanding of machine learning and how to use it to solve complex problems
- Looking to build a foundation for continued learning in the machine learning space and data science in general
Prerequisites
To get the most out of this course, you should have a general understanding of data concepts as well as some familiarity with cloud providers and their managed services, especially Amazon or Google. Some experience in data or development is preferable but not essential.
As you start to build your mental understanding of what goes into a model, what comes out of a model and how to build your own, remember, you're trying to represent something. You're trying to build a system of inputs and outputs that simulate or recreate a phenomenon somewhere else. Very importantly, however, you don't need to fully understand the system. You're going to make some assumptions about what is possible, what are the inputs, what are the outputs? And the machine learning system will be able to weight these inputs in order to make better predictions. You don't need a series of if else statements or rules, what you need to do is make assumptions on what controls the system in real life, what factors go into it.
For example, when using machine learning for weather forecasting, we don't fully understand how much different variables affect different things, otherwise, we'd be able to predict weather for all time with 100% certainty. Instead, we know that geography and topography from maps affects it, we know that the motion of celestial bodies such as the position of the sun and the moon affect the tides, which in turn affect it, and all of these go together into a machine learning model, and then it'll create predictions with varying degrees of accuracy about the upcoming weather.
You actually see this in the news every day, when they say 60% chance of rain, that is a machine learning algorithm's confidence that it is going to rain that day. And while we're on the subject of confidence, it's a very important takeaway that many times a machine learning model, won't say if something will or won't happen for sure, you have varying degrees of confidence. Maybe it's a 20% chance of hail, 60% chance of rain, and 20% chance that it's overcast, and a 0% chance it's going to be sunny. You're able to get different levels of confidence for different outcomes out of certain machine learning models.
So when thinking about it, think about it in terms of inputs and potential scenarios that occur as measured in your outputs with confidence intervals.
So of course, the question on everyone's mind at this point is probably, how do these abstract pieces of code, logic, inferences, and other wizardry make these predictions? Well, outside of a couple of sections of machine learning such as deep learning, the process typically starts with selecting a model that is a good representation of the relationship that you wish to present.
For example, if I want to model a child's height over time, I might want to pick a linear model based on age, so I have a theory. The only thing that affects a child's height is age. And as they get older, they get taller, therefore there is a linear relationship between height and age. Note that this isn't true, and there's a lot of other factors, but if I'm going to start down the path of building a machine learning relationship, this maybe is a great starting point. I have a theoretical input, age, and a theoretical output, height, and I assume it's linear.
So with the model in mind, we're going to train the model. What this means is we are going to programmatically feed it data of known ages and heights in order for it to begin to build the relation. Under the surface, a lot of things happen, as you move up, especially through upper level three and lower level four, you'll begin to build more of an understanding of how it does the training, the system it goes through. But for now, just know that when you train a model, you are basically saying, here are the known inputs and here are some expected outputs, please attempt to build a relationship between them.
After training, it's important to evaluate your model. What this means is we're going to give it a new set of data of 100 different children in heights, and see how well it predicts it. This can be called the models fit or a models accuracy. Very, very importantly, make sure you use different data in the evaluation and the training. You can run into a scenario called overfitting where if you attempt to evaluate it based on training, where the model might be near 100% accurate in predicting its own training data, but completely worthless, for predicting anything out of that.
If you only have one data set, a good ratio, keep in mind is something along the lines of two thirds of your data is kept aside for training, and one third is kept aside for evaluation. As you go through different models, be sure to cycle which data points are in which set, just to help avoid any biases or unknown errors in your data. But very importantly, after training, you need to evaluate it to see if it's a good fit. And finally, after you've evaluated the predictions, it's time to refine the model.
Basically, if the predictions are way off, maybe we need to go back and verify whether we need to change the underlying assumptions or the model type itself. Perhaps there's another variable, controlling height, such as nutrition, or perhaps we thought of the initial model wrong, and it's not a linear model, but an exponential model. At this point, we are either going to change how we are inputting and outputting from the model, or we are going to change the type of model itself or potentially we've said the model is good enough for our purposes, and we are going to push it into production.
For those of you looking for a practical hands-on example, keep your eyes out for some of the labs at the end of this class. These will walk you through a Python-based code notebook, building and evaluating a real world model. So you'll be able to see directly how you would do this in a professional or school environment.
So for those of you sitting at home, you might be thinking, out of all the machine learning models in the world, how do I know which one's right for my phenomena? Potentially if it's a complex system or we're not sure really what shape the data even is before evaluation. So in order to answer that question, let's go through some of the terminology and lingo used in the machine learning space in order to communicate how you can pick the right model out of a very large field of lots of different types of models and then variants within the types.
So let's start with a simple example. The data set in front of you, where you could see that we have an input on the x axis and the measured output on the y axis, what would be a good fit for this model? Is it linear, nonlinear, exponential? So here you could see three different models and how they predict that data.
On the left, you could see what's called an underfit. In this case, the model does not accurately represent how the input and the output are related. On the flip side, on the right, you see what's called an overfit. This is actually what happens or tends to happen when you use training data as your test data. You perfectly predict the scenario, such as when the input is one, the answer is 19.
Now this might not have any relation to the real world, but in your training set, in your evaluation set, it's true all the time, therefore the model assumes that. And in the middle, you see a good fit. The line moves through the data naturally, it doesn't make assumptions where it has hard code in and JAG at edges, and it doesn't vary from the data very much.
Now, as we progress through the subject of machine learning, there are ways to quantify the fit. This could be called something such as, something called coefficient of determination, sometimes referenced as R squared or R two. You might also hear things such as coefficient of variation, which uses something called a root mean square error called RMSE in order to represent how well the model fits.
So if you ever see the expression RMSE, CV, R squared, or COD or CD, just know that that is a mathematical way of expressing how well the model fit the data. Very importantly, too, since all of those ways of evaluating model fit are programmatic, that is how some of the tools around AutoML, or maybe something like Databricks or DataRobot, which are third party closed source tools begin to choose the right model for you and take some of the legwork out of it.
However, assuming we're not using an automated tool, humans will also look at these model fits and determine which model is the best. And your ability to quickly determine and select the right model, is really how data scientists start to really differentiate themselves from the general population, in that they are experts at finding the right model and understanding the complex relationships to make the whole machine learning process smoother, more efficient and more accurate.
Now, when we talk about training a model, what do you really mean? In other classes, we've shown using a model, we've talked about how giving a model more data makes it more accurate, but as we start to have to pick our own model, understanding how a model gets trained and what data's available to us could also influence what model we actually pick.
So instead of just saying, we're going to use training data, let's break down our terms and understand what goes into training data, and how do we classify our training data. And then how can we feed it to the machine learning algorithm to start learning.
Lectures
Course Introduction - Explaining Concepts - Understanding Training Data Sets - How to Choose? - Case Study: Home Prices - Case Study: Heart Disease - Case Study: Animal Classification - Case Study: Targeted Marketing
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.