Training the Model
Start course

This course takes an introductory look at using the SageMaker platform, specifically within the context of preparing data, building and deploying machine learning models.

During this course, you'll gain a practical understanding of the steps required to build and deploy these models along with learning how SageMaker can simplify this process by handling a lot of the heavy lifting both on the model management side, data manipulation side, and other general quality of life tools.

If you have any feedback relating to this course, feel free to contact us at

Learning Objectives

  • Obtain a foundational understanding of SageMaker
  • Prepare data for use in a machine learning project
  • Build, train, and deploy a machine learning model using SageMaker

Intended Audience

This course is ideal for data scientists, data engineers, or anyone who wants to get started with Amazon SageMaker.


To get the most out of this course, you should have some experience with data engineering and machine learning concepts, as well as familiarity with the AWS platform.


After selecting a model, the next step of course is training it. And to be perfectly honest, with smaller data sets, you can just train it in your notebook instance. There's no reason that you have to go beyond it. But especially as you get bigger or if you need a more codified, repeatable way with many permutations, it makes sense to go to Amazon's training part of the platform.

This is a separate section, once again assessable through the core dashboard. So assuming that you're working with a reasonably medium or large sized data set, your notebook won't be enough. And by clicking at the training, you're presented with a lot of options, but at its core what this does, is it creates a distributed compute cluster temporarily, to do the training and store the artifacts when done. 

If you remember, any resource usage on Amazon, starts to incur a meter, much like a taxi and the training section through spinning up many expensive resources and then automatically deleting them, allows you to just use exactly what you need, to get a good chunk of computing done.

So within the job settings of course, first things first, you have to set a name and security, but then you also have to select a model. Now you could use one of their many pre-made models, use one that you've made yourself or also go to their marketplace, which allows you to either use a free one or pay a slight subscription fee to use somebody else's pre-made one. And after selecting the model or maybe going with a subscription from the marketplace, you're asked to describe how your job will scale. Basically, what type of instance do you wanna put behind it? How many instances and how long do you want it to run? This is more of a DevOps question but even as a data scientist or a data engineer, it really helps to have a base understanding of what the right infrastructure to run your jobs is, because it'll allow you to make intelligent decisions about how big of an instance and what the implications are. After that, you're asked to start to tune hyper parameters.

Now, this is very model specific and can get very confusing. Of course, using a notebook to explore different hyper parameter implications upfront, is extremely helpful. But just know that these parameters really define how the model will behave at a high-level concept, such as, how many clusters it should attempt to make. And finally of course, you need to select where the data's coming from and for this, you're most likely actually going to select S3, particularly if you did a previous training job with GroundTruth or with your notebook.

However, there's other places it could live. This section's pretty straightforward, especially if you've done the other steps of making good label training data. Just make sure it can access this bucket, and very importantly, it can access the output path because that's where the results of the training job will be stored. AKA, when the training job is done and the model is fully described and then can be imported by other parts of the job, it's actually stored in S3 between steps.

About the Author
Learning Paths

Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity.  With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing  decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.