Level 2: Training Concepts
Level 2: Training Concepts

Welcome ​to Part Two of an introduction to using Artificial Intelligence and Machine Learning. As we mentioned in part one, this course starts at the ground up and focuses on giving students the tools and materials they need to navigate the topic. There are several labs directly tied to this learning path, which will provide hands-on experience to supplement the academic knowledge provided in the lectures.

In part one we looked at how you can use out-of-the-box machine learning models to meet your needs. In this course, we are going to build on that and look at how you can add your own functionality to these pre-canned models. We look at ML training concepts, release processes, and how ML services are used in a commercial setting. Finally, we take a look at a case study so that you get a feel for how these concepts play out in the real world.

For any feedback relating to this course, please contact us at

Learning Objectives

By the end of this course, you'll hopefully understand how to take more advanced courses and even a springboard into handling complex tasks in your day-to-day job, whether it be a professional, student, or hobbyist environment.

Intended Audience

This course​ is a multi-part series ideal for those who are interested in understanding machine learning from a 101 perspective; starting from a very basic level and ramping up over time. If you already understand concepts such as how to train and inference a model, you may wish to skip ahead to part two or a more advanced learning path.


It helps if you have a light data engineering or developer background as several parts of this class, particularly the labs, involve hands-on work and manipulating basic data structures and scripts. The labs all have highly detailed notes to help novice users understand them but you will be able to more easily expand at your own pace with a good baseline understanding. As we explain​ the core concepts, there are some prerequisites for this course.

It is recommended that you have a basic familiarity with one of the cloud providers, especially AWS or GCP. Azure, Oracle, and other providers also have machine learning suites but these two are the focus for this class.

If you have an interest in completing the labs for hands-on work, Python is a helpful language to understand.  



Just to reiterate, a model is trained before it can be used for inferencing. I've color-coded the words just so that you can start to see common topics grouped together. There are many ways to train a model, many different paths and completely different approaches, but for our applications, especially at introductory level, let's focus on one of the most common and easiest ways to begin to train a model. At introductory levels, many applications used what is called labeled training data.

Now don't take this as a always-true rule. For example, in Google's deep learning on YouTube example we discussed in module one, it self-discovered the concept of cat videos without labeled training data. But, as we're discussing in this lesson, let's focus on examples where we're able to train it with labeled training data which covers many of the introductory and easiest-to-access models.

In order to break down what exactly the training process is, let's begin to discuss what labeled training data is. This is literally how machines learn through examples. The entire concept of labeled data refers to that you have an input and then you are showing the algorithm or the machine learning what a typical output is expected. So to just go over that one more time, you're providing a series of inputs with expected outputs.

Now a label, aka an expected output, could be something as simple as a sentiment score, saying this is an example of a positive statement or a negative statement, or perhaps, in the case of the aforementioned tree leaf example, in which we are labeling oak and maple leaves, it's simply a collection of pictures as the data, and what tree they belong to as the labels. 

The point is, you as an analyst have to create the labels, or you're using already existing labels as a source of truth to provide the expected responses. The entire process, how to label data, is a bit involved and we'll cover that in a moment, but for now, know that whenever you hear labeled training, that you really know it's a pair of data, of expected input and expected output.

Let's revisit the example we used in module one, where you are an engineer responsible for analyzing feedback given to your online training platform. You could see some labeled training data here used to start to create a customize sentiment model. On the left, you can see the data going in, and then on the right, the expected response.

As you may imagine, a commercially available sentiment model might not be tuned for handling academic and educational feedback, so the statement, "I found this course extremely helpful" is extraordinarily positive in this business and the statement "I was confused by this class" is negative.

Now, depending on the sentiment model, this could not necessarily be very positive or very negative. "Helpful," that's positive, but for our case, it's extremely positive. And "confused by this class" is a phrase that needs to be honed in on because it could also say, "I was confused, but this class helped me." Understanding how these phrases relate to your specific needs is where level two, and being able to customize models to your needs, really starts to shine.

Also, it's important to know that you do not need and actually you don't particularly want an exhaustive set of all conditions and permutations. As a rule of thumb, it's better to have more training data than less training data, but for those of you who are familiar with, or for those of you who might not be familiar with, be cognizant of the concept of over-fitting a model. 

Without going into the details, just know that if you attempt to create every single possible sentence and feed it into a sentiment model, you're actually degrading the model's performance. What you should be aiming for is a well represented set of conditions that the machine should expect and potential outputs, but don't worry about capturing every single potential input because the machine might just over-learn and not be able to handle variability as well.

One final note on the subject of labeled training data is that the process of creating the labels could actually take longer than the machine learning itself. This is actually one of the major stoppers of a machine learning project in how do we get enough samples to train the machine. Fortunately, there are many ways to get the training data that you need and let's go through a few of the most common ones.

Of course, there's bring your own data. And this is the case in which you, your company, and your co-workers, or your other students in an academic setting, are starting to put the labels on the data yourself. It is an in-house initiative that you're using your in-house subject matter experts, such as your analysts, your researchers, your students, your interns, your employees to create the labels. However, there also are many services now available to help you with this.

Some of the cloud providers, such as Amazon and Google offer services such as Mechanical Turk, and actually Google offers a labeling service billable directly through their cloud platform in which you can create a set of human-readable rules and then humans will try to label them on the other end. This is a fantastic option if your labels don't require highly specialized knowledge when creating a training set.

But keep in mind these services do not necessarily have people who are specialized in your particular area of interest. In my personal opinion, these labeling services are ideal when you're identifying something that is easily identifiable to any average person off the street or could be trained to a person within an hour.

Something such as identifying a specific piece of distinctive art, or maybe, on a more everyday level, if you're working for a furniture company, you need to identify an L-shaped desk versus a straight desk. The key is an average person needs to be able to understand it because you're not even guaranteed the same person between tasks.

And finally, due to the massive interest in machine learning, particularly from sites like Kaggle and through Apache-licensed data sets, there are free or educational use or commercially available data sets that you can purchase that already have labels made. Companies have actually specialized in aggregating these tremendous data libraries that you're able to purchase and immediately use in your business.

Now, once again, this is avoiding the need to hire a service or do your own labeling, but a company is not going to have highly detailed labels for you. However, a lot of these companies have niches and you might be able to find a fantastic start so you don't have to necessarily start with zero. Just know that there is an extra cost incurred as you are having to purchase data to train your model.

In summary, when creating labeled training data, it's a time-intensive process and potentially difficult. For anyone watching this that has a management or administrative role, be prepared to augment your data science team in the short term to help them get the labeled training data, but once you get through this, and if you do it well, your entire machine learning process will be easier and produce more meaningful results.


About the Author
Learning Paths

Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity.  With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing  decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.