Preparing Data

Developed with
Calculated Systems

Contents

keyboard_tab

The course is part of these learning paths

Start Modelling Data with Amazon SageMaker
course-steps
2
certification
1
lab-steps
3
Getting Started with Machine Learning Models
course-steps
2
certification
1
lab-steps
5
AWS Machine Learning – Specialty Certification Preparation
course-steps
39
certification
14
lab-steps
15
Start course
Overview
Difficulty
Intermediate
Duration
31m
Students
412
Ratings
5/5
starstarstarstarstar
Description

This course takes an introductory look at using the SageMaker platform, specifically within the context of preparing data, building and deploying machine learning models.

During this course, you'll gain a practical understanding of the steps required to build and deploy these models along with learning how SageMaker can simplify this process by handling a lot of the heavy lifting both on the model management side, data manipulation side, and other general quality of life tools.

If you have any feedback relating to this course, feel free to contact us at support@cloudacademy.com.

Learning Objectives

  • Obtain a foundational understanding of SageMaker
  • Prepare data for use in a machine learning project
  • Build, train, and deploy a machine learning model using SageMaker

Intended Audience

This course is ideal for data scientists, data engineers, or anyone who wants to get started with Amazon SageMaker.

Prerequisites

To get the most out of this course, you should have some experience with data engineering and machine learning concepts, as well as familiarity with the AWS platform.

Transcript

So finally, let's dive into SageMaker. Enough of the theory, enough of the concepts of how machine learning works. Let's actually just take a hands-on look at the tool. So on screen, you're seeing how SageMaker looks as logged into, through Amazon Web Services. If you don't know how to get here, check out the services button in the top left and look around for SageMaker you also be able to search for it and click it.

This page shows a surprisingly similar workflow to what we just discussed in the middle of it. Of course, there's some options on the left but the focus is purely in the middle, you could see a series of tools used to help achieve all of the previous machine learning workflow steps we just discussed.

Starting at the left, you see Ground Truth. As we've previously discussed, data preparation really is the foundation of any machine learning project. So much so that Amazon even labeled its data preparation stage, Ground Truth.

Now, if you ever tried to label or create data sets before, you've probably had some success if it's small. If it's a small amount of data that you understand very well you're often able to just handle it yourself. But as you start to move into the thousands or millions of records or perhaps there's many images that need custom annotations this becomes an unmanageable task.

Now you'll have a few choices of course. Your company could scale and hire some temporary help, but Amazon offers a semi-automated, semi-human driven solution through Ground Truth. This section of the platform actually allows you to annotate your data sets by creating workflows. And these workflows could either use some of the previously mentioned machine learning models.

So in this case, Amazon SageMaker as a platform is actually referencing applications or it actually has a way to route you through to people who have been trained on how to label data what they call their human workforce, and completely human intelligence tasks. So regardless, if you want to just have some help yourself maybe have some machine learning help with a pre-trained framework or employ other people, this is a good option for handling any of those tasks.

Clicking into Ground Truth, the first thing you're confronted with is a plethora of options about how you want to label your data. At the top, you see a dropdown that allows you to select if it's text or image and such, but within each type of data, you have different types of labeling you can do. And the type of labeling really can make or break your solution.

So, first things first is whether it's whole image classification or bounded classification. So a whole image classification what you see at the top here as single label or multi-label simply is this picture as a whole describes what my machine learning model should detect.

So as you can see at the top left we have it being labeled basketball. And very importantly, we're not narrowing in on any given part of this image to describe basketball. We're not zeroing in on the ball. We're not zeroing in on the hoop or the players. We're just saying this entire image represents basketball.

Very similarly on the right, we're saying that this entire image describes both humans and vehicles. It can be a little confusing at times but in multi-level imaging, it's not saying look just at the human or just look at the vehicle. It says this entire thing as an entity is both a human with a vehicle.

Now, if you want to start to be more specific and say that you want to very specifically call up part of the image, or maybe this element will be remixed in other images, you need to start to look at bounding boxes and semantic segmentation. What this does is it tells the machine learning image, look at what's in this segment or look at what is within this box, and apply the labels just to that. Of course, as you move from the top left to the bottom right in this selection it gets increasingly difficult. With semantic segmentation being down to pixel level accuracy while whole image obviously is the entire image. 

If you were to select that the task category was text, you are also going to be presented with various similar options. Once again, you have whole text classification such as single label or multi-label, and you also have named entity recognition.

Now it's kind of heretical to say this, but in all reality named entity recognition is very similar to putting a bounding box on your image. You're basically saying this section of the text has a specific label applied to it. Either it's a person, a phrase, or an expression, but basically single label and multi-label text classification is whole text while named entity recognition is the ability to isolate phrases or people and objects out of the text.

To quickly move through with the last two types. Basically, video is very similar to image and text with the added element of time series. So you can either classify the whole clip and object within the clip or very uniquely to video, you can track an object as it moves through multiple frames AKA multiple images. As a general rule of thumb, video classification is a bit more time intensive and difficult than just image classification. But of course, you can add a lot more details when you have time series and can see the object in context.

Now do keep in mind if you have a video, you don't necessarily need to do video machine learning. You are able to isolate specific frames and treat it as image machine learning. But that's a little beyond the scope of this class on which one to pick, but just know that Amazon Ground Truth has the ability to both do images and video depending on where the actual source of that image data is coming from.

And finally, if your data doesn't really fall into any of these, or maybe it's too complex or any given one type, you can actually start to define a custom set of tasks for both yourself or the human intelligence workers on the other side of the screen, to start to help you process your data.

About the Author
Avatar
Calculated Systems
Training Provider
Students
5734
Labs
31
Courses
13
Learning Paths
17

Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity.  With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing  decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.