SageMaker Studio - Getting Started with Data Wrangler
SageMaker Studio - Getting Started with Data Wrangler

Get started with the latest Amazon SageMaker services — Data Wrangler, Data Pipeline and Feature Store services — released at re:Invent Dec 2020. We also learn about the SageMaker Ground Truth and how that can help us sort and label data. 

Get a head start in machine learning by learning how these services can reduce the effort and time required for you to load and prepare data sets for analysis and modeling. Data scientists will often spend 70% or more of their time cleaning, preparing, and wrangling their data into a state where it’s suitable to train machine learning algorithms against the data. It’s a lot of work, and these new SageMaker services provides an easier way. 


When you look at the console, it's really quite difficult to tell where you get started with these new services. So there are some steps that you need to do in the SageMaker Studio before you can start using, or even accessing the Data Wrangler tool. The first step is, you need to provision SageMaker Studio if you haven't done this already.

Now, if you need to provision SageMaker Studio for the first time, but I'll show you how to do that. Otherwise, skip to the next lecture, which is Setting up the SageMaker Data Wrangler. Okay. So let's walk through setting up SageMaker Studio.

Now, to do this, there's two options. You can use the Quick start, or you can set up the account to be run as a team account. So best, if you're just starting this process, use the Quick start. So open the SageMaker console, choose SageMaker Studio from the top left-hand side of the page there. And on the Studio setup page, under get started, choose Quick start. Okay, let's create a name for our Studio. We can keep that default name if we want or make up our own. We can have up to 63 characters, using characters, numbers, and a hyphen. 

Okay. We need to choose a role for SageMaker to execute. So for the execution role, you can either choose one from the role selector, or you can create your own IAM or ARN role. So if you create new role, the Create an IAM role dialog appears. The role must have the Amazon SageMaker full access policy attached to it. And we can set from here, what do we want the role to be? And we must ensure that it has this Amazon SageMaker full access policy attached to it.

Now, you might find that when you first try and create this, it does error out. Go back in and do it again. You'll notice that the SageMaker full access policy has been created if you didn't already have it. Next step is for the S3 buckets that we're going to use, you need to specify what they are. If you don't wanna add any access to more buckets, just choose None.

Okay, so now we create the role. Now, as I mentioned, there's two options with the roles. We can do this quick setup or we can use the team setup, which is for projects basically. The standard setup, which basically gives you a little bit more control over how you provision the Studio, you can use either AWS SSO authentication or an IAM role. And basically, if you're using the standard setup, then each user, each member, gets a unique sign in URL that directs them to the Studio and they sign in with their SSO credentials.

One FYI, if you're using SSO, then the organization account needs to be in the same AWS region as the Studio account, okay? So just keep that in mind if you are planning on using SSO. So you can set a little bit more granularity around the usage using the standard setup over the fast setup. Options when we're using the standard setup is that we can select the VPC we want to run it in. You set the VPC. You can also set the subnets that we want to use, limit the network access for Studio, whether it's public only or VPC only.

We can set security groups. We also have the option to set encryption using one of our KMS keys if we have one. And we can tag. Once these are all set up, then we hit the Submit and then be prepared to wait for a little while. Eventually it will get there. Don't worry about the wait, it's well worth it. You'll see the status is ready. The execution role is created. The authentication method is set. We can see these settings we've got in here and this is where we can enable the projects if we want it. Very useful for Studio projects, which we'll walk through a little later.

We can access the Studio now that this is provisioned, but remember, it can be quite a while before it actually starts up. So don't be alarmed if you end up waiting for five minutes while the Studio is provisioned. Now, just a quick word on, if you wish to go back from a fast startup and use SSO or use the standard setup, you actually need to delete your original SageMaker Studio. Now to do that, you have to remove all of the applications and all the instances. And then basically the Studio itself is labeled as a user.

So once you've removed all of the applications, then you can essentially delete the user for the Studio. And when you've done that, once you've removed all the applications, then you will get the option to delete the Studio and that will remove it completely. Then you can go through and set it up using the standard setup, choosing either SSO or an IAM role.

About the Author
Learning Paths

Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built  70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+  years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.