A Conversation about AWS Glue

Contents

keyboard_tab
AWS Glue

The course is part of this learning path

play-arrow
A Conversation about AWS Glue
Overview
DifficultyIntermediate
Duration9m
Students25

Description

In this brief course, our AWS experts Will Meadows and Stuart Scott take a moment to talk about a few special considerations for AWS Glue that are helpful for those thinking about taking the AWS Certified Data Analytics - Specialty certification exam.

Learning Objectives

After watching this course you will know the pricing structure of AWS Glue and what DPUs are. You will learn how bookmarks work and what their role is within AWS glue.

Prerequisites

Basic knowledge about AWS glue and database analytics / big data workloads.

Transcript

- Hi everyone I was talking with Will, and he had some really interesting points about AWS Glue and the data analytics certification. So I thought it'd be a really good idea to record a quick session with him. So Will, would you mind sharing some of those thoughts with us?

- Sure, so we were going over the database analytics domain and the certification itself, and we noticed there are some key points that could be discussed here that I think would help someone obtain the certification. 'Cause when I was going over the test myself, I was caught by quite a few interesting aspects of the domain that I needed to go look further for my own education. So I figured if I needed to brush up on some of the skills and topics myself, maybe some of our students would also enjoy that and would like to learn more.

- Excellent. I totally agree. So what did you find out?

- Although it may seem obvious after taking the test, it's actually quite important to know about bookmarks within AWS glue. When working with long running ETL jobs, it's necessary to accurately note when the previous job had finished or where it had finished and any important state information as well. Maintaining this position allows you to come up with new data processing instead of rehashing old content and AWS Glue deals with this problem through the use of these job bookmarks. As an example, you might have an ETL job where you are reading new partitions in a Parquet file stored within Amazon S3. Well with bookmarks, AWS Glue can keep track of which partitions have already been processed and this prevents duplication of data in the data store and that saves you time and money and all that good stuff. I'm not sure how relevant this is for the test, but AWS Glue bookmarks are not available for all file types. So version 0.9 supports JSON, CSV, Apache Avro and XML. And then version 1.0 and above supports JSON, CSV, Apache Avro, XML and Parquet and ORC. So that's the big difference is the last two.

- So how do you tell Glue to use bookmarks?

- Yeah, that's a very good question. So when running a job with Glue, you can specify if using a bookmark, by passing the jobs bookmarks argument, which has a couple of options. So you can enable this, which allows Glue to keep track of the previously processed data as we discussed and we'll process new data since the last checkpoint from which it stopped. And you can of course disable this, which tells Glue to process the entire dataset, leaving you responsible for managing the output of previous jobs. There's also a pause function, which allows you to have previously done some of this bookmarking, but you could pause now and we'll just process new data since the last successful check mark or bookmark. And then you could also use this to specifically process an area of data without updating the bookmark in total. And this pause argument has two sub-arguments, which is the from value and the to value, which gives you that range. These arguments were used together, meaning you must tell Glue both of them in order for it to function.

- Okay. So do you have any good examples of using bookmarks in a real world scenario, like in a production environment?

- Well, one of the other nice things about bookmarks is they can be used to help fix your dataset if you mess something up, which is gonna happen. So there are situations where you might need to refresh or backfill your datasets. So for example, you've found a pragmatic error that maybe added, I don't know, plus two to all your data points before it's even entered into the database. There might be ways to fix this downstream, but the most straightforward to me, at least is just to update the source data and then re-run the job again. And you can support these scenarios better by rewinding your job bookmark to the last working job where the data was accurate. So maybe it was only like a subset of the whole data set was contaminated and you can just reset the bookmark to that position and rerun the entire dataset from there. But please keep in mind when resetting and rewinding your bookmarks, that AWS Glue does not clean the target files. So it's important to create new target files when rewinding the data in order to prevent duplication.

- Awesome, so bookmarks are there to help keep track of what data you have processed and allow you to backtrack if you need to. Is that right?

- Yeah, pretty much.

- So what else did you learn about Glue that's relevant for the certification?

- Well, if you're doing any kind of development with Glue, you really know and understand about developer endpoints.

- And what are developer endpoints exactly?

- So when building and testing your ETL architectures and your scripts within these to implement Glue, you will find that running a job can take quite a while. And when you factor in cluster provisioning time and the time it takes for the test itself to complete, you'll notice a lot of idle time. Equally annoying is the process of having to dig through the logs afterwards, to figure out what went wrong in the first place when you try to work with your scripts and something will inevitably break. These are the normal pains of development. So instead of having to wait for all this idle and to go through this, you can instead use a developer endpoint, which allows you to interact with Glue in a notebook environment. And notebooks are probably familiar to you if you've worked with Jupyter Notebooks or SageMaker Notebooks, but if not, they're an interactive medium that allows you to iteratively build and test your ETL scripts. And AWS Glue allows you to use Apache Zeppelin or Jupyter Notebooks, if you've ever heard of Apache Zeppelin. You could start off running the notebooks locally to test your scripts, or you can run them on an EC2 instance and fully connect to your datasets and all the good stuff that goes with it. And that process for setting up a developer endpoint with an attached notebook is basically just create an end point, spin up a notebook server on an EC2 instance, and then securely connect your notebook server to the development end point and then securely connect a web browser to that notebook server. Just as easy as that. If you have any questions about that specifically, here's a link or something I'll put right here.

- Okay. So the big question Will, is there a cost associated with this or is it all free for us to use?

- Oh God, no, it's not free. That's not how AWS works, a lot of the times, but that does lead me to the next thing I wanna talk about, which is to point out the cost structure of Glue, which is oddly important for the test. AWS Glue has a unique pricing structure that is important to be aware of. You are charged for crawling data, the discovery phase, as well as the ETL process itself, this process, the loading phase and the processing. You are charged for the storage of the metadata within the AWS Glue Data Catalog. The free tier covers the first million objects that are stored and accessed. If you're testing, you won't run over this, which is particularly nice. But just like many AWS services, you only pay for what you use when you're performing the ETL and the crawling. And so the good news is you'll only be charged for the time the job is actually running. You don't have to worry about the start-up or the shutdown time. So that's cool. And I think that leads me onto the next thing is I wanna talk about the rate at which you're charged, 'cause it's based on a number of DPUs. these your Data Processing Units that are assigned to your task. AWS states that a single DPU represents four vCPUs and 16 gigabytes of memory. So you can think of those as worker nodes that are processing your data. And the more you use, the faster that you go through your tasks, but of course the more you're getting charged. And you're currently getting charged, I believe 44 cents per hour billed by the second. So that's fun.

- Is that price varied by region or is it kind of a global fixed price?

- I believe it's a global price, but you can check here. I'll say yes or no, if that's accurate, sometimes you gotta do some double checking. But I think one of the things that it does break down a little bit, that's kind of fun, is depending on what type of job you're running, there's minimum, maximum numbers of DPUs you're required to have. Apache Spark, for example, has a minimum of two DPUs. And by default AWS Glue allocates 10 DPUs to each Apache Spark job. So for using Glue 2.0 and above, there's a one minute billing minimum duration and Glue 1.0 and 0.9 have a 10-minute billing minimum duration with a maximum of 100 DPUs. So these numbers are actually gonna sound very familiar when I talk about Spark Streaming, which has another, another minimum of two DPUs and then has five DPUs for each Spark Streaming job. And there's another 10-minute billing duration for Spark Streaming with a max of 100. And now here's where things get squirrely. You thought 10 minutes billing durations, 10 DPUs, minimum of two sounds great. Python shell, however, is probably the strangest because it has a minimum of 0.0625 DPUs, which is the default setting as a maximum of one. So that's the one strange one you gotta know. Sorry, it was a bit of an information dump, but I think it's relevant. Oh, and talking about the developer end points, those require DPUs, two for minimum and by default allocates five and you get billed 10-minute minimum duration for your developer end points. There you go

- Excellent. Thank you very much Will. And I really appreciate you running through that with us. Some really interesting and key points there to make note of, and hopefully that will help a lot of other people that are taking the AWS Certified Data Analytics Specialty exam.

- Yeah.

- Yeah. Thank you very much. Or anything to close off with just before we finish the session?

- I would say, I spouted a lot of numbers there at the end and they're important, but just read them once or twice before you go into the test and I think that'll be good enough for you.

- Excellent. Thanks very much, Will.

- Yeah, cheers.

- Cheers.

About the Author

William Meadows is a passionately curious human currently living in the Bay Area in California. His career has included working with lasers, teaching teenagers how to code, and creating classes about cloud technology that are taught all over the world. His dedication to completing goals and helping others is what brings meaning to his life. In his free time, he enjoys reading Reddit, playing video games, and writing books.