Get started with the latest Amazon SageMaker services — Data Wrangler, Data Pipeline and Feature Store services — released at re:Invent Dec 2020. We also learn about the SageMaker Ground Truth and how that can help us sort and label data.
Get a head start in machine learning by learning how these services can reduce the effort and time required for you to load and prepare data sets for analysis and modeling. Data scientists will often spend 70% or more of their time cleaning, preparing, and wrangling their data into a state where it’s suitable to train machine learning algorithms against the data. It’s a lot of work, and these new SageMaker services provides an easier way.
Process wise, it's simple, simple. You select the data you want with the SageMaker Data Wrangler's data selection tool, you can set up data sources from pretty much any data repository. Once you've set up a data source, you import the data into a SageMaker Studio, and this is where it gets really clever. The Data Wrangler has 300-plus data transformation templates or filters that you apply to your data set.
At this point, it's just like clicking a button to apply a filter to improve or transform a photo, it's that easy. This makes it really simple to normalize data by removing spaces, capitalization, trailing commas and the like. And you can also use the templates to transform a dataset, i.e., make it all sentence case or convert ASCII to char, or virtual into char, et cetera. And all these steps you can do without having to write any code. And just like the simplicity of photo filters, you can preview and inspect how the data will look with the template applied using the Data Wrangler's visualization templates.
So that's data loading and normalization made super simple. And as you get further into the SageMaker workflow, you can see how easy it is to integrate with SageMaker Data Pipelines, which allow you to save and reuse machine learning workflows in the SageMaker Feature Store, which is basically a way to manage an entire project and share it with the team.
Okay. The SageMaker Pipeline function enables you to connect the data prepared in Data Wrangler with the modeling process. You can automate model deployment and management using this one SageMaker workflow. The Data Wrangler UI has a nice visual interface to help explain where things are in the current process and pipeline. While ultimately this makes it easy for you to move data into modeling, it also potentially makes it easy to describe to business stakeholders what's been done and where you're up to in a transformation process.
You can save and share these workflows in the SageMaker Feature Store so the wider team can use and reuse them for their own projects. The SageMaker Studio is provided free of charge, alright? Now, the Data Wrangler underlying machines do attract a charge. However, you get 25 hours of the M5.4XLARGE instance a month in the AWS free tier. So 25 hours per month free, alright? SageMaker Data Wrangler jobs are billed by the second, so it's basically equating to less than a dollar an hour.
There is a AWS SageMaker pricing calculator, so you can get a heads-up on expected costs before you start a mess of normalization run. Let's just quickly look at the costs here using the Cost Explorer from the AWS Console. So we can see the M5.4XLARGE Notebook. We've got around a $1.29 after a day's usage, so it's a relatively small cost per day. Look, think about $8 per day at an average.
Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built 70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+ years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.