Setting Up SageMaker to Run Data Wrangler
Setting Up SageMaker to Run Data Wrangler

Get started with the latest Amazon SageMaker services — Data Wrangler, Data Pipeline and Feature Store services — released at re:Invent Dec 2020. We also learn about the SageMaker Ground Truth and how that can help us sort and label data. 

Get a head start in machine learning by learning how these services can reduce the effort and time required for you to load and prepare data sets for analysis and modeling. Data scientists will often spend 70% or more of their time cleaning, preparing, and wrangling their data into a state where it’s suitable to train machine learning algorithms against the data. It’s a lot of work, and these new SageMaker services provides an easier way. 


We start using the data wrangler service from the SageMaker studio. The first thing we need to do is set the right notebook instance for the wrangler to run on. So, from the SageMaker studio, select create a notebook instance, give your instance a name, then the most important setting is to ensure that it has the right instance type. So, you need a minimum of an m5.4xlarge, instance to run the wrangler service.

We have the option to set elastic inference, which allows you to add inference acceleration to a hosted in-point for less cost than if you're using a full GPU instance. So, choose that if you want that. We can limit the access rights. We can turn off root access using an IAM role, and we can turn encryption on or off.

We have a networking, git repositories and tag options as well. Once we have our notebook instance running on a m5.4xlarge instance type, then we will be able to use the wrangler service. Now, that take a little while to provision, as you will find with most of the services in SageMaker studio, when you're doing them for the first time. Don't be alarmed. It gets much quicker over time. Okay. We can see now that we've got one notebook instance in service. So, now we can go ahead and open the JupyterLab environment. First of all, let's just have a quick look at the actual Jupyter notebook itself. This is just the raw notebook without the SageMaker interface over the top of it. All right. Just gives us the basic info. Now, if we want to open the studio, we'll get our full SageMaker studio view. 

All right. So here we are. There are a few ways you can start a new data flow either using the file new flow command, or we can access it from inside the project manager. So, a flow is a data wrangler basically, to start a new flow from the manager, Let's go file. And then new and new flow. Choosing a new flow is how we create a new data wrangle. And, first of all, we need to give it a name, something that we can recognize. We have these four stages, import, prepare, analyze, and export. So, the first thing we're going to do is wanting to create a data source. Again take a little while to establish the connection to the engine. So, be patient the first time you run this, you'll see here It's not quite available yet, it will be really in a few minutes, but we're ready to go now.


Introduction to SageMaker Data Wrangler - Getting Started with Data Wrangler - Using Data Wrangler - Introduction to SageMaker Ground Truth - Service and Cost Review

About the Author
Learning Paths

Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built  70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+  years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.