Understanding Training Data Sets
Start course

This course explores the core concepts of machine learning, the models available, and how to train them. We’ll take a deeper look at what it means to train a machine learning model, as well as the data and methods required to do so. We’ll also provide an overview of the most common models you’re likely to encounter, and take a practical approach to understand when and how to use them to solve business problems.

In the second half of this course, you will be guided through a series of case studies that will show you how to apply the concepts covered in this course to real-life examples.

If you have any feedback relating to this course, feel free to contact us at

Learning Objectives

  • Understand the key concepts and models related to machine learning
  • Learn how to use training data sets with machine learning models
  • Learn how to choose the best machine learning model to suit your requirements
  • Understand how machine learning concepts can be applied to real-world scenarios in property prices, health, animal classification, and marketing activites

Intended Audience

This course is intended for anyone who is:

  • Interested in understanding machine learning models on a deeper level
  • Looking to enrich their understanding of machine learning and how to use it to solve complex problems
  • Looking to build a foundation for continued learning in the machine learning space and data science in general


To get the most out of this course, you should have a general understanding of data concepts as well as some familiarity with cloud providers and their managed services, especially Amazon or Google. Some experience in data or development is preferable but not essential.


When discussing training data the two most important phrases that you'll need to know or hear or use are feature and label. A feature sometimes called a variable or predictor is a property of the training data that's used as an input to the model.

On the flip side, a label is the target of a model. Now features and labels can be different types of data and different things. Features might be numeric in linear, increasing from values zero to 10, while a label could be numeric or it could be Boolean such as true or false.

On the flip side maybe you have Boolean features and numeric labels. So don't assume that one type of data has to be on one side or the other, but just know that a feature is what goes into your model and a label is what comes out of your model or what you want to come out of your model.

So for example, if we wanna go back to our previous theory and model that age and height are directly related, we have two features now. Gender and age, these in turn are labeled with height on the right. So here since we think gender and age affect your height, there are the features and on the right, is the label AKA the expected output. When this is together, we call this a labeled data set.

Labeled data sets are incredibly important because they can be used for training as a labeled training data set in which you feed the machine learning model both the features and the labels or you could use it as a labeled evaluation set or test set in which you feed the machine learning model only the features, and then using one of those previously mentioned, fit determinations such as coefficient of determination or coefficient of variance, you can compare the labels to the machines output.

So a labeled data set allows you to both train and evaluate the model, a slight tangent aside but an important note to understand it as you become more and more familiar with data science and data engineering, is that features and labels aren't strictly separate.

What is a label for one machine learning problem might be a feature for another. For example, gender plus height might be able to be used as an age predictor for a different model. Just know that when you're assigning labels and features for your model, it is relatively specific for that model. But importantly once you have an organized well formatted cleaned data set you can really make this switch between feature and label quickly.

We actually have a whole another course on data engineering. So check that out as well if you wanna know more about how to prepare, clean and ready your data. And finally, one last point on features there are really two types, there's numeric and categorical.

Numeric means that there's a number associated with it such as age, weight, score. On the flip side, there's categorical and there could be unordered such as lefthanded versus right-handed or maybe gender such as male or female in which case there's different categories but no distinct hierarchy or there's ordered in which there's a hierarchy such as ratings being negative, neutral, or positive.

Now, before we dive into the specific types of machine learning model, let's take a step back and codify everything we've covered so far. A great way to do this is called CRISP-DM or Cross-Industry Standard Process for Data Mining. This is a really fun to say acronym, and you could tell is made by committee, but it really does help us visualize and conceptualize the entire process end to end.

The way to read this chart is to start at the top, the 12 o'clock position on a clock and work your way around clockwise. So whenever you're starting a new machine learning initiative or really any data initiative it's important to start with a business understanding.

Now, of course, this doesn't mean that you need to understand the cashflow or the accounting side of it. It just means, can you articulate your problem at a high level? This means we wanna know how tall people are or we wanna know, is it going to rain on Sunday, simply put high level non-technical understanding of the problem.

As soon as you gain a business understanding it's important to build your data understanding. In the previous height example where the business understanding is we want to predict someone's height. Understanding the data that goes in and out of that is key.

So of course the data coming out of it would be the person's height. And then we could start to theorize that the data going into it as maybe age, maybe gender, maybe some other things but between business understanding and data understanding, we wanna be able to clearly articulate our problem and articulate what types or what specific data sets go into affecting it.

You might notice that you might bounce between your business understanding and your data understanding as you build your total knowledge up and that's natural. But as soon as you have a good idea of what data points and data sets go into it, you can move on to what's called data preparation.

Data preparation is usually, especially in a newer or immature organization or newer or immature use case the part that takes the longest. This could typically take up to 80% of a data scientist's time depending on which study or what report you read. And it has all the steps from collecting the data, formatting the data, or maybe labeling the data and getting it into a good usable format.

Maybe it's in a database, maybe it's in a CSV or JSON. This is the role of a data scientist or a data engineer, and is one of the hardest and most difficult parts of data science outside of the model building itself. The next two steps, modeling and evaluation go hand in hand. This is what we just discussed where you're putting the test data that you just prepared into the model and evaluating its fit.

Notice you might move between evaluation modeling and data preparation freely as you start to understand how the data needs to be used and how well your fit is. And the key here too, on determining how good your fit is is that this whole thing might be an iterative process.

You might iterate between data preparation, modeling and evaluation, and then go back depending on your results. But once you have a model you're happy with you move to deployment or sometimes called pushing a model to production. This is where the team beyond the data science team can start to take advantage of this and it can be built into your application business project or handed off down the line.

It's very important to note that multiple parts of the cycle might be happening simultaneously within an organization. Perhaps with a company set on predicting the weather, you have a team gathering radar and satellite data for preparation. You have another team trying to model and evaluate it.

You have yet another group running a older model in production AKA deployed, and you have some analysts working on building a better data understanding of the system. This whole cycle will go around and around with each new release of your model. And it's important to understand that especially in an older or more mature team, you might be at multiple points at the same time, but overall when you're just starting on your path, think of the journey as starting with business understanding, ending with deployment before going around again.

So when picking a machine learning model you'll quickly realize there's dozens, if not hundreds of potential options. This could very quickly lead to analysis paralysis and not being sure where to start. Fortunately, there are some flow charts we'll show you and ways to categorize machine learning models which will hopefully simplify our selection process and with experience who you will be able to quickly narrow in on the one that's right for your particular problem.

First off is supervised learning. This is the type of model that uses the labeled training data in order to build its predictions. So if you recall where we want it to predict height of the person, we used age and gender as features and height as the label. This means that our training data had both the inputs and outputs clearly mapped and these supervised learning algorithms take this data and produce the results. We call it supervised learning because the algorithm is given all of the data it needs to build a model.

Contrary to supervised learning is what's called unsupervised learning. This uses unlabeled training data to make inferences without having any labels to learn from. Algorithms of this category, look for relations in the features in order to build an understanding of how they come together.

A classic example of this would be cluster analysis. In which case we feed the dataset a series of XY coordinates and attempts to put them into groupings. It's important to know that the type of problem that goes into unsupervised learning might not fit into supervised learning and vice versa.

For example, the height prediction model that we use supervised learning for wouldn't really work in an unsupervised learning set.. So unsupervised learning is really good for if you have the inputs and you want it to discover relationships between those inputs. And thirdly living kind of between supervised and unsupervised learning is the concept of semi-supervised learning.

If you remember from our previous discussion, labeling and prepping data is oftentimes the most time consuming part of this entire machine learning endeavor. So by doing semi-supervised learning we take time to build labels for some of the data but not all of the data. This is what you should consider if labeling the data and creating good training data is particularly expensive for your problem.

Classic examples of this are when it requires a skilled human in order to create the labels but the training data is relatively cheap to get. Classic examples of semi-supervised learning would be a problem such as audio transcription, translations and really physical problems such as deep sea oil mapping where you will be able to manually analyze some of the data, have your problem be mapped by a machine learning model and finally iterated over again and again, despite the fact that a human hasn't created an entirely labeled data set.

And the final category that we will discuss in this course is the concept of reinforcement learning. This is a bit different than the previous ones in that it uses trial and error to determine the best course of action and the best way to build the model.

A classic example of this is a game such as chess. The machine is able to make predictions about what it should do. Do those moves against a human player and see how well it did. So over time through being told it did a good job or a bad job, or in the case of chess being able to measure it, it's able to build a better and better model over its iterative approach.

Reinforcement learning is a strong choice if you don't have a lot of training data, but you're able to quickly say if the model is a good predictor or not a good predictor, based on the current scenario. At this point, you might be thinking do you really need to know any more to get started? And to put it frankly, the answer is no. At this point in the class you should have enough information in order to get started on some of the more simple machine learning problems.

There are actually many tools out there to help you pick the right model and build from here. There's private tools such as data bricks and data robot, or depending on your cloud platform. There's probably tools built right into it. It's important to understand the differences between these tools such as, does it help you pick the right model? Does it expect you to program the model? Is it custom built for a specific type of problem, but with the understanding you have here you're actually ready to get started building your own machine learning use cases from the ground up. However, in order to help you get started and pick the right tool, let's go through some of the most popular tools on the market as it's available on the Google and Amazon Clouds.

One of the most popular tool sets out there is Google Clouds, AutoML. These are basically pre-built but not pre-trained models that allow you to get started quickly. If your problem falls into one of their predefined buckets, such as video analysis, image analysis, language translation, or sentence syntax and evaluation, this is a great set of tools. However, due to the amount of pre-work that Google has done for you, they can be somewhat un flexible.

In my opinion, one of the best tools to get started that takes away a lot of the upfront infrastructure deployment effort is Amazon Sagemaker. Basically it's a managed IPython or Jupiter Notebook that comes with a many pre-installed algorithms. It also helps you auto scale up and down if you start to use a lot of compute.

Now, there's too many algorithms to really go through directly here. There's going to be a list on screen, and we're going to walk through how you can start to pick the right one but just know that as you go through with Amazon Sagemaker you're expected to understand some Python Coding and data preparation yourself.

So check out the courses on that, if you want more information but just know that Sagemaker is a phenomenal starting point to get building your own models from the ground up. We actually have a course coming up on it where we'll show you how to deploy your Notebooks and how to start interacting with them. So keep your eyes open for that also.


Course Introduction - Explaining Concepts - Models - How to Choose? - Case Study: Home Prices - Case Study: Heart Disease - Case Study: Animal Classification - Case Study: Targeted Marketing

About the Author
Learning Paths

Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity.  With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing  decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.