Overview of Amazon Sagemaker
This course takes an introductory look at using the SageMaker platform, specifically within the context of preparing data, building and deploying machine learning models.
During this course, you'll gain a practical understanding of the steps required to build and deploy these models along with learning how SageMaker can simplify this process by handling a lot of the heavy lifting both on the model management side, data manipulation side, and other general quality of life tools.
If you have any feedback relating to this course, feel free to contact us at firstname.lastname@example.org.
- Obtain a foundational understanding of SageMaker
- Prepare data for use in a machine learning project
- Build, train, and deploy a machine learning model using SageMaker
This course is ideal for data scientists, data engineers, or anyone who wants to get started with Amazon SageMaker.
To get the most out of this course, you should have some experience with data engineering and machine learning concepts, as well as familiarity with the AWS platform.
Before I get away from myself and start to promise that SageMaker will solve all of your life's problems, let's just be very clear what SageMaker is. At its core, SageMaker is a fully managed service that provides the tools to build, train and deploy machine learning models. That's a very concise mission statement for what SageMaker helps you achieve. It has some components in it such as manage notebooks and helping label and train models, but at its core, SageMaker should be thought of as where you go to when you need to build, train and deploy models.
Now, to look at how it fits into the larger Amazon stack, we have this diagram. At the top of this diagram or at the top of the stack, if you wanna be a little more formal, you have some of the application services. These are tools such as Rekognition, Transcribe, Lex, Translate and Comprehend, that are really full service options for developing a very focused machine learning model. They basically provide a heavily pre-trained model that allows you to then sometimes tweak it and fit in some custom identifiers.
If you've been attending the machine learning content course in Cloud Academy, you'll recognize that these application services are more serviced by what we call a level one machine learning user. If you haven't attended that class series yet, I strongly advise checking it out. I also taught that one and in that we really dive into how to actually interact with some of these programs and start to build your own models, whereas this class more focuses on SageMaker as a platform.
So at the top level, just to reiterate, you have some of the straightforward pre-trained machine learning models that maybe allow some tweaking and adjusting. The next layer below application services is platform services. Whereas application services are like going to a bakery and buying a cookie, platform services are more like buying a cake mix with a recipe on the side. It gives you the tools to make your machine learning product, but you need to bring your expertise on how to assemble and how to fine tune it to get the best result.
In this layer is where SageMaker lives along with some other very popular services, such as Spark, EMR, perhaps you've heard of Databricks or DataRobot. Those all live in the platform services range. Although sometimes with some of the more advanced features, they can drift towards application services.
Once again, to tie this into the machine learning content path, platform services are really what everyone from level two upward would use. Once you start to dabble and making your own models and prepping your own data, you start to really need a platform to get started.
Now below both applications and platforms are frameworks and hardware. Many of the platforms allow you to use these frameworks in order to improve your machine learning product without having to start coding from the ground up. Things such as TensorFlow, MXNet and Pytorch come with the code four models in them, and you have to start to feed it hyper parameters in data. So it's not like you're starting from the mathematical thesis, you're able to import libraries. Cloud Academy actually has a lot of content on how to use these specific frameworks. So feel free to check that out as well after this class. And also down here are things such as your compute options.
SageMaker and Amazon in general does a really good job at tying together their entire ecosystem. So if you wanna be able to pull in things such as graphical processing units, that's a p3 instances if you wanna talk about specific servers or attach extra CPU or memory, Amazon allows you to add hardware in order to help accelerate your solution and change its performance characteristics as you become a more advanced user.
So this is really the stack of how you should think about the machine learning ecosystem on Amazon. It actually really applies to all of the clouds, but the products slotted in here are more AWS specific.
Now, to really understand where SageMaker as a platform sits, let's discuss the machine learning workflow. In reality, this is all machine learning is, is stepping through these four steps. Now, for those of you who are familiar with the process or data scientists, you might realize that it's highly reductionist of me to simplify the entire workflow, but as a thought exercise, it really... we should think about it in kind of four steps.
So as many of you probably have realized, the first step in making a machine learning model is preparing your data. This broad task includes everything a data engineer would do such as collecting the data, cleaning it, making it accessible, annotating and transforming it. This step is kind of paradoxical in that it is both the most approachable by people, relatively new to machine learning, but it's also the hardest. Whereas it's easy to get started and starting to manually clean your data, but learning how to do this efficiently, reliably and consistently is actually one of the greatest challenges in machine learning.
Many people say that they spend 60 to 80% of their workload in the preparing the data step. So don't skip over this thinking that it's just a hurdle on the way to machine learning. Doing a really good job of preparing data and being able to get it together and accessible will make everything else go earlier. You can't build a good house on a shaky foundation. So once you've laid your foundation, aka prepared your data, you can start to build your model. And as you've probably seen, there are dozens of different models out there.
We've actually done a short overview of these in some of our other classwork. Cloud Academy has some deep dives on other ones and really what model you use and which framework you pulled in, is where an experienced data scientist starts to shine. So SageMaker will help you build your model by making the frameworks available and helping you pick the right one.
So once you've built the model or picked it from a framework, you need to start to manage its training, to determine things such as it's confusion matrix, it's false positive, it's false negative rate and generally understand how it runs. So in order to do this, you need to set up different training runs, different evaluation conditions, maybe try running it on different hardware. This is the stage in which you actually need to understand how your model will work in the real world and if you've made the right choices.
And finally, as if all of this wasn't enough, building a model, prepping the data, training and tuning it, many times, one of the hardest parts and obviously the ongoing part is deploying the model into a production environment and continuing to monitor it. Many times, data will drift, maybe the underlying paradigm that the machine learning model governs changes, and you need to be aware of the models performance changes over time and the entire life cycle management. That's a phrase you've probably heard before, it's worth remembering.
You need to manage the models life cycle. This could be everything as we mentioned, from detecting drift, but lifecycle management also refers to making sure the underlying hardware scales up and scales down in order to handle loads. For those of you interested in this section, this starts to really go towards DevOps and beyond machine learning. So this is a good point where you might need to tap some of your DevOps talent or check out how to size underlying machine learning hardware on Amazon.
Calculated Systems was founded by experts in Hadoop, Google Cloud and AWS. Calculated Systems enables code-free capture, mapping and transformation of data in the cloud based on Apache NiFi, an open source project originally developed within the NSA. Calculated Systems accelerates time to market for new innovations while maintaining data integrity. With cloud automation tools, deep industry expertise, and experience productionalizing workloads development cycles are cut down to a fraction of their normal time. The ability to quickly develop large scale data ingestion and processing decreases the risk companies face in long development cycles. Calculated Systems is one of the industry leaders in Big Data transformation and education of these complex technologies.