Using Data Wrangler
Start course

Get started with the latest Amazon SageMaker services — Data Wrangler, Data Pipeline and Feature Store services — released at re:Invent Dec 2020. We also learn about the SageMaker Ground Truth and how that can help us sort and label data. 

Get a head start in machine learning by learning how these services can reduce the effort and time required for you to load and prepare data sets for analysis and modeling. Data scientists will often spend 70% or more of their time cleaning, preparing, and wrangling their data into a state where it’s suitable to train machine learning algorithms against the data. It’s a lot of work, and these new SageMaker services provides an easier way. 


Flow. Okay, so that flow is basically the data Wrangler service. So it's called flow in here, three options were given, we can use S3 bucket, we can use an Athena data source, or we can use Redshift from this other data source here. Now, if we have a remote data source, we can use Athena to remote click, to connect to a remote service, and the net will appear inside our Athena dashboard. Now we'll have to wait because we haven't connected to their database engine yet.

Okay, while we're waiting, let's just have a look at how much this costs to run the usage on this is just SageMaker. And so far we've had what do we got the $8.92 from a full day's usage. So basically here's our costs for the previous day. We've got our notebook in five, four X large, instance been the majority of our cost. We've got the Kernel Gateway for that as well attached as a cost and just a couple of other usage utilities. So yeah, around $10 a day, if you're gonna use this full time, the best way to work with SageMaker of course is to just spin it up and spin it back down again as you need it. 

Here we are back in our studio. And what we've done is we've started up a new flow, which is the data Wrangler. And now that the data source has been connected and service has been connected in the background, we can add a data source. So we choose S3 as the first one. Now we get a preview here of the data, and it's very quick to load it loads the first 100 view rows as part of a snapshot, you can turn it off sampling, it's called.

So the sampling just basically pulls back the first 100 rows. If you turn that off, then it won't immediately preload surfers any data that you have that's significantly heavy or, or sensitive, then you can turn that off. So you don't see it. Now at this point, we've got a name, we've got a URI and we've got the type. So it automatically identifies the type of fire, which again is super cool. 

Now, what we can do is just import this dataset. And straightaway, we're given this visual representation of what we're doing here. So this is our source and this whole six variance can be modeled how you like it, you can make it bigger or smaller. It's a really fantastic little tool, it's, I'm in love with it already. And I've only been using it for two days, it's been live, but it's super fantastic. And then the data types themselves, this is where data Wrangled becomes really clever, very easy for us to start to normalize our data.

So under this click here, we've got the option to add a transformation or an analysis. So let's do the transformation first, change the case, or remove some trailing characters or spaces, all those things that tend to take up a significant amount of time. This service does all that for us, from the simple UI, as simple as applying a filter, like we were applying a filter to a photo.

For example, let's just start with drink formatting, always the biggest drama, right? Let's make everything uppercase, lowercase, title, capitalized swap case, even. How cool is that pairing out? We can take characters from left or right. Removes your eyes is so cool, man. I just love this cause you know, this is one of the most time consuming things as string manipulation.

So let's set it all to uppercase. And what we'll do is we'll set our, got our quiz verb lower case of what we do here is choose the column we want and we'll make it our request verb. Here we go. 

Now we can change the name of it if we wish . We can also manipulate the columns themselves underneath some of these other things, but let's just do this first. We get a preview. So we basically get to see what's it's gonna do. How's that? All right, so straight away, it's just made that transformation for me. Fantastico, so I can add that I can do others.

So I don't have to do just one of the time, you know, I can manage the rows. We can move the rows. Oh, look at this, this is so cool. So sort them or shuffle them through sort, we'll sort it by timestamp. How cool is this so easy? Like even if you're doing a lot of your manipulation and Athena first, all of these templates just make it so simple. We can preview that. It's gonna reorder those columns for us. How good is it? All right, so add that.

So those are my two transformations done. Now I can do way more. There's pretty much no limit. There's over 300 of these templates in here. There's really, really powerful functions. And you can also just write your own and heres, which is another benefit we can use any piece of code we want, if we haven't already done this, we don't really wanna go back into the notebook to do it ourselves. And this is simple to do. And I like this for when we've just got one simple change that we may need to make all it back and forth between data engineer and data scientist, where we may found one or two anomalies in the data, rather than having to go all the way back and redo it. We can just open this flow again, add this new rule and then disconnected all up again. That's where it gets super powerful as if this wasn't enough.

So here we are, we've got our transformation here. We've got two steps in there. We've got our data here and now we can add an analysis to this as well. So at this next point, where and again, just let me reiterate here. The flexibility of this is phenomenal. So we can join data sources with Athena outer, any type of joint statement we like basically, left, right, it's very, very easy to do joins inside this.

So it's another very, very powerful thing here. I won't get into that in too much detail. Basically, there's not much you can't do. You can concatenate data views as well. So if you are joining two data sources and we want to combine them and concatenate them, then we can do all that from here, which we probably try and do in SQL, if we didn't have this ability, which is gonna save us a lot of time.

So if we end the analysis here, this is where we can just quickly look at our data and check what's going on here, to see if there's anything that's not quite right. So let's say client response time, we can choose the graph type. We've got plenty of options here. We'll just let this refresh while we're waiting for that. You know, all of the common histogram, scatterplot, quick modeling even, we can do that quite quickly from here. There we go.

So it's just given us that quick preview. We can apply this, and it will create this analysis for us and do as many of these, again, as we want and add another one, let's do a scatter plot and we'll call this one and we'll do it on the ports, what ports used and the backend processing time. So this is all super simple, you know, you can literally build all this in a matter of minutes. And then using the projects feature, you can share this amongst the team. It's just the best.

Those are our steps, we've got a format string, we've got our manage rows. We've got the artist string here, and we can create a new one and we'll choose the requests processing time, back end processing time and the back end port. And you can see you've got all the data down here as well. So it's really easy to work with this plot. So we can create that. Now we've got to this one. If we go back to our flow now, it will just save this first so that we don't lose what we've got.

So we've got two steps, we've got format string, and we can export it from here. How cool is this? Okay, so now this is where we can use the pipeline function. So once you've built this transformation, and we've done a quick analysis to see how it's going to look, we can export this to a pipeline. All right, so we can do it as Python code even. Or we can edit to the Feature Store. So we'll do that first. Alright, and then we're gonna make a pipeline of it as well.

Okay, so first off we need to choose a Kernel. So we use a Python three, one. Again, it's just another notebook, but it's all linked together. And it takes out a lot of the tooing and froing, it's just telling us what we're doing here. And these pipeline. So that's on its way back in our flow. We can export this to pipeline. There it is. So that's our entire transformation all done for us.

What do you think about that? Isn't that just so cool. It gets a little tricky trying to remember where you are, but once you get your head around the area, so we've created the analysis on that one particular function, but not both. So that's interesting. We can consider that this analysis is per transformation, which is also quite useful because that way you can actually check that the transformation or the transformation that you're doing is gonna work in the way that you're expecting the little icon there tells us that we have added an analysis.

So let's add one here, just for the song as well. Just do a cystogram to of course, change some case statements. So we'll make help we use the string and the request verb. Now it's just telling us that this is the first one, this is our second one, right? It's pretty messy transformation, but that's fine. It's really difficult to show that since we just made it string manipulation, but that's all good.

Back in now, view and remember, you can move these around, like, this is really cool. Like this basically gives you a view of everything. So once you get quite complex, you can zoom this in and out, and you can move it around. You can realign them. I think this is so neat. You know, like we've got multiple data sources that another one, we'll add another data source in here, and maybe we can combine them and look at a few things.

Here we are, we'll take that - nope can't do gzip. So just a slight FYI does not currently support zipped files. So I haven't quite worked out how to do that from here yet. Obviously, you would just go into the bucket and unzip the file. 

Okay, so that doesn't support text file. So it's only CVS or parquet. I'm sure that we'll see some other important types supported over time. Let's choose an Athena data source. We'll use the catalog here. All we need to do is choose a database. And if we have a remote database, we can set it up via Athena, and we'll be able to connect it from here.

We'll use ELB logs. So it's Athena we're using SQL from. And if we run that, we should get a result set, this is the way to connect us to a remote data source of any type basically. And of course, using it from Athena is another easy way to manage and manipulate along the way, if you need to. Okay, so we'll import that Athena. And there we go, look at that.

There's our second data source there. We can add our transformations onto this. Okay, so it's telling us that we haven't got permission to do that. Alright, so that makes sense. If we had permission and that joined would work. So I'll just change that. We'll leave that in there for now, but we can do a transformation list. We'll probably get the same era. Actually, if we try and do a tree up there, there we go.

So Athena what's happening here is that a SageMaker doesn't have the correct access rights for Athena at this point. So I'd have to reconfigure that. Okay, so let's get back to doing a little bit of work here, so, okay, so we've got a sample there. Let's import that data set. There we go.

Now, if you want to add a join in here, click our little button here, so would join these two together. So we choose our second data source. So while you're previewing things, you've got to remember to actually enact them, delete that. Yes, and I'm going to add another important, another dataset. Okay, there we go. That's what we want, so we'll take that raw data. We've got a preview of a 100 rows because we've got enable sampling showing here. So now we'll import that data set.

Oh, how do you like that? So remember this is, okay, so a screen is, I've been thinking that hasn't been working, but it has, it's just simply that you've got to remember. You've got this overall view here. So to navigate all my samples down here, now I've got these two, which is the ones I want. So now I've got my raw data.

Now I wanna do a join on these two, so I can choose, join, move you up there. Okay, so that's our left and right join sofas, super cool. Then we can configure the join type and we'll go lift, lift doubter, and here we just need to choose some fields and we'll apply those. So I've now joined is two data sorts sets, which is gonna give me some really interesting results. And I go add, cool bananas. So there's my mixing it up join. 

I can do any transformations I want at any point here. So I might just add a few here to these data first. So let's add transform into this first that we can just strings text, removes zeros, right? One zeros nulls, always massive one. This is so cool, out of these 300 templates, you will find what you need. Once you got familiar with all of these templates, there's nothing that you can't do basically. So zero is Knowles to handle those as Novick characters.

All right, so back to data flow, we could also do a quick analysis on this. So we're going do a histogram. It was called and we'll just do a page and pay perhaps one or two of these. Cool, isn't that neat? Right on oh, and you can do facets as well. Man, so cool. Not that that's a very good line, but incredible, incredible power here.

Let's make it super easy. All right, so, you know, you've got everything at your fingertips here. This is just the best. At this point, we may be ready. We've proven that this is probably gonna model, okay. And then it's just a question of exporting this so we can export the step. We can do this to our feature store so we can save this. And if we wanna come back and redo this, and if we have to make any changes to it, then it's very easy for us to do this, right. So that's just the best integration in my view. It just makes it so simple.

So what's just going on behind the scenes here. Let's just show you that I can just have a quick look here at the, behind the scenes, what the markup is. So, you know, it's essentially marking down all of these things for us, so we can import and export these in between other projects, or we can manipulate these, how we want. So, you know, it really is super, super flexible, very easy UI to work with as well. I really liked this. It makes it very easy to share with the other members of the team, show people what's going on. I can download it. Saving as a flow, wondering if I can export this image some sort, and they want to just snapshot there, there we go. 

The easiest and simplest way of data manipulation that I think I've ever seen. I highly recommend getting into this tool. If you're working with data in any way, it's gonna make your life way easy. And you know, the fact that we can share amongst projects as well, and the pipeline itself makes it even easier. I think the file types for imports is probably the only thing that over time, we'll see more edit, but remember, you can use Athena to connect to remote data sources of any sort.

You just need to be thinking CSV or Parquet. I think you're supporting other formats would be the first thing that we want, but we could easily transform a text file or any other file to CSV in Athena. And then have it show up in here as a data source. That's fantastic. We can see our feature store projects in here, and we've just been working with no code basically, but it's creating a new feature group.

So feature store is a really easy way to share flows and transformations and manipulations across teams. So, you know, you can basically save it in the feature store and then we can revisit that flow and make any improvements or minor adjustments to it if we need to. And we can do online storage or offline own mystery bucket. Now, the one thing to consider here is that you need to store this into send the same region.

So I would recommend using offline storage to so you can use your own buckets. I can do our definitions if we need to. Okay, so we're gonna just get a template. So these templates save us a lot of time. We've got organizational ones as well, which we'll look at in a minute, but there's a real quick way of creating a simple project. Right, so we'll just give us a name.


Introduction to SageMaker Data Wrangler - Getting Started with Data Wrangler - Setting Up SageMaker to Run Data Wrangler - Introduction to SageMaker Ground Truth - Service and Cost Review

About the Author
Learning Paths

Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built  70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+  years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.