If you have data that needs to be subjected to analytics, then you will likely need to put that data through an extract, transform and load (ETL) process, AWS Glue is a fully managed service designed to do just this. Through a series of simple configurable options, you can select your source data to be processed by AWS Glue allowing you to turn it into cataloged, searchable and queryable data. This course will take you through the fundamentals of AWS Glue to get you started with this service
The objectives of this course are to provide you with and understanding of:
- Serverless ETL
- The knowledge and architecture of a typical ETL project
- The prerequisite setup of AWS parts to use AWS Glue for ETL
- Knowledge of how to use AWS Glue to perform serverless ETL
- How to edit ETL processes created from AWS Glue
This course is ideal for:
- Data warehouse engineers that are looking to learn more about serverless ETL and AWS Glue
- Developers that want to learn more about ETL work using AWS Glue
- Developer leads that want to learn more about the serverless ETL process
- Project managers and owners that want to learn about data preparation
As a prerequisite to this course you should have familiarity with:
- One ore more of the data storage destinations offered by AWS
- Data warehousing principles
- Serverless computing
- Object-orientated programming (Python)
We welcome all feedback and suggestions - please contact us at email@example.com if you are unsure about where to start or if would like help getting started.
Hi, in this demo, I review the basics of AWS Glue as we navigate through the lifecycle and processes needed to move data from AWS S3 to an RDS MySQL database. Before I begin the demo, I want to review a few of the prerequisites for performing the demo on your own. I recommend that you watch the previous video which goes over AWS Glue features. Also, you must set up a security IAM role that will work with AWS Glue and the source and destination data. I also recommend that you have some familiarity and understanding of AWS S3 which is what is used as a data source and RDS MySQL which is what is used as a destination for the database. And that you're familiar with either the Python or Scala languages. In this video, the Python option is used. Note that if you do any serious customizations to transform the data before ultimately loading it to your destination, you will want to know the language Of choice of either Python or Scala. In a way, Glue is a nudge to many ETL developers to learn a programming language for moving data. I
n this demo, I will give a brief overview of tools and show how to create crawler and launch to crawl our source data which is a CSV file in S3. Also create a connection to our destination table and I'll demonstrate what classifiers do for crawling. We'll create an ETL job to extract and transform data. Then we'll run the ETL job to extract, transform and load data to our destination table and then I'll show the data in the destination table. Here's a diagram to show the lifecycle of the demo. Here's our S3 bucket. We're going to use the crawler to then create a database in the data catalog and then we'll run our create and run our ETL job to move data from the S3 source to the MySQL database. After that, you have the option of using Amazon QuickSight or some other BI tool to view visualizations and to analyze your data.
The flights log data is free for you to download from the Bureau of Transportation Statistics and dates all the way back to January 1995 through January 2019. The flights comma-separated file or CSV file that amounts to 25 megabytes has been uploaded to my S3 bucket called flights-npw. AWS Simple Storage Service or AWS S3 is an object storage and can be used to store and protect any amount of structured and unstructured data. I use a smaller file for the purpose to demonstrate what AWS Glue can do to extract, transform and load data even though AWS Glue along with other ETL tools can move huge amounts of data with relative ease and speed. On a side note, there is a feature in S3 which allows you to view data from a file before you consume it somewhere else like in an ETL job. Simply select the file, then click on Select from, choose the format and you can either click on Show file preview or click Next and run the SQL expression. This is a fairly convenient method because you can see and evaluate at a glance the raw data before you load it. So let's go right to AWS Glue.
The first thing we want do is create a crawler. Now, when we create a crawler, in the process, we will create our database and our table which is the data catalog and stores the metadata for our data source or data destination. In this case, we're just creating a crawler for the data source. Click Add crawler, then I'll select my data source which is the CSV file that we downloaded from Bureau of Transportation Statistics again that we uploaded to our S3 bucket here. Click Next. Now we could add another data store if we wanted to, in this case, we're not and then I've already created my IAM role and then we can schedule the crawler to run hourly, daily or on demand. We'll just leave it on demand but sometimes you have new files come into your bucket and you want to crawl more regularly. Capture the schema changes.
Let's add our database. Call it mymetadatafromcrawler. We're not adding a prefix to the table name. And then we're done. And all we need to do is run the crawler. Now, while the crawler is running, it'll produce some logs and these logs will tell us if the crawler has been successful and also allow us to troubleshoot if there are any issues. That's an example of a log where a crawler ran successfully. In addition to running the crawler, we also want to create a connection to our data destination. We know that we're connecting to MySQL database. And we can click Finish. Now, one thing to note about crawlers is that we can add a classifier to our crawler to help define our data source and there're a number of options, these are classifier types. If you choose Grok, then you an choose a classification or data format type and you can provide a Grok pattern. You can also select XML, JSON or CSV and one of the more common data sources is a CSV file but CSV files are different where they may have be comma-separated tab or pipeline separated. In this case, you want to define your CSV file. Our CSV file is very common and we didn't need to use a classifier. We actually used the default classifier. It looks like our crawler has completed.
We have a data connection created and now what we'll do is create a job and the job will be Csv_to_MySql. Again, we'll use the same IAM role that we created prior. We'll keep the defaults. We'll have the script generated by AWS Glue, and Python. And now we can choose our data source. And now we can choose our data target. Now, in this case, we want to create the table on the connection that we created. And our database name is destinationdemo. Let's click on Next. Now you can see that the mapping is here. We can make some edits. We could add a column, could add an empty column if we wanted to. But everything is in place for us to get the data from the data source which is our CSV file, transform it and map it to our target database. Now once we click Save, then it generates the Python code and this is where we can make additional customizations. In fact, if we click on Transform up here, and then it gives us some boiler text for dropping fields in the Python language and it's not very much code. So let's go ahead and run our job.
While our job is running, let me go over crawlers one more time. The crawler that we created, created a database and a table and the table, when we open it up, shows the metadata or the columns that are in our source file and we can view proprieties which come in the form of JSON. We can also edit the schema, so if we want year to be an int, for example, we can change Bigint to an int data type. Update that and save. We can also add a column. We can also compare versions, so for example, if our crawler crawled two different sources, we can see how the schema has changed and compare those to metadata definitions between one and the other. So let's go back to our job. You can see that it's completed. And if we go to our MySQL Workbench which is a free download, and we go to the table, we can see the data that's been populated. Let's see how many rows. So we've loaded 1,783,308 rows. One other item I wanted to discuss is adding a trigger. Now, a trigger can schedule a job that we've created. It can run jobs in a specific order or on demand. So if I want to run one job right after the other, I would want to create a trigger for that. And there's my trigger and if I want to enable that trigger, I select Action and Enable trigger. And as I'd mentioned before, you can actually go in and create your own job following a small tutorial that AWS offers here. And that's the demo. I hope you enjoy using AWS Glue. Thank you.
About the Author
Move a metric, change products or behaviors...with data -that is what excites me. I am passionate about data and have worked to architect and develop data solutions using cloud and on-premise ETL and visualization tools. I am an evangelist for self-service data transformation, insights, and analytics. I love to be agile.
I extend my understanding to the community by giving presentations at Big Data Conferences, Code Camp, and other venues. I also write useful content in the form of white papers, two books on business intelligence, and blog posts.