If you have data that needs to be subjected to analytics, then you will likely need to put that data through an extract, transform and load (ETL) process, AWS Glue is a fully managed service designed to do just this. Through a series of simple configurable options, you can select your source data to be processed by AWS Glue allowing you to turn it into cataloged, searchable and queryable data. This course will take you through the fundamentals of AWS Glue to get you started with this service
The objectives of this course are to provide you with and understanding of:
- Serverless ETL
- The knowledge and architecture of a typical ETL project
- The prerequisite setup of AWS parts to use AWS Glue for ETL
- Knowledge of how to use AWS Glue to perform serverless ETL
- How to edit ETL processes created from AWS Glue
This course is ideal for:
- Data warehouse engineers that are looking to learn more about serverless ETL and AWS Glue
- Developers that want to learn more about ETL work using AWS Glue
- Developer leads that want to learn more about the serverless ETL process
- Project managers and owners that want to learn about data preparation
As a prerequisite to this course you should have familiarity with:
- One ore more of the data storage destinations offered by AWS
- Data warehousing principles
- Serverless computing
- Object-orientated programming (Python)
We welcome all feedback and suggestions - please contact us at firstname.lastname@example.org if you are unsure about where to start or if would like help getting started.
In this series, I've introduced AWS Glue and compared it to traditional ETL tasks performed in a role such as a business intelligence engineer. I also detailed the most important parts of Glue to extract, transform and load data from point A, our source data, to point B, a destination file or data repository. AWS Glue performs a lot of this work for you. Remember that the reward is prepared, consumable data from better analysis. The transformation is often the more heavy lifting work to combine from a variety of sources, clean the data, map correctly to the destination, form into a new schema, aggregate or disaggregate and more.
I detailed the benefits of using AWS Glue and these include ETL code in AWS Glue easily runs serverless. AWS Glue has a crawler that infers schemas for source, working and destination data and the crawler can run on a schedule to detect changes and AWS Glue auto-generates ETL scripts as a starting point for customizing in either Python or Scala.
I also provided a demonstration to show you how to use the features of AWS Glue to create a crawler to generate metadata for the source file which facilitates our mapping from source to destination. I showed you how to create a simple job and after running, I showed you the transformed data inside a table in a MySQL database. Lastly, you should consider further exploration of AWS Glue and here are some good steps you can take to expand your ability in AWS Glue. You can learn Python or Scala and now that you know how to set up and create an AWS Glue job, I recommend that you learn Python code so you can customize your ETL job scripts to give you more flexibility.
A repository in GitHub called AWS Glue ETL Code Samples, as shown with the link on screen, offers examples to help you with creating relationships between tables in your destination after loading, undo and redo results from your call operation and more. There are also some examples for Scala as well. You should practice with data that interests you and you can find more examples of programmatic data manipulation. I recommend visiting kaggle.com and downloading a variety of data files and uploading to AWS S3 and then combining and drawing correlations. Become familiar with AWS QuickSight and other visualization tools and remember that the analysis of data is the end goal after the work of transforming and loading is completed and likely automated.
Power BI Desktop is a powerful visualization tool that can be downloaded for free and here is an AWS Big Data blog on 10 visualizations to try in Amazon QuickSight with sample data.
If you have any feedback on this course, positive or negative, please contact us by sending an email to email@example.com. Your feedback is greatly appreciated. Thank you for your time and good luck with your continued learning of cloud computing. Thank you.
About the Author
Move a metric, change products or behaviors...with data -that is what excites me. I am passionate about data and have worked to architect and develop data solutions using cloud and on-premise ETL and visualization tools. I am an evangelist for self-service data transformation, insights, and analytics. I love to be agile.
I extend my understanding to the community by giving presentations at Big Data Conferences, Code Camp, and other venues. I also write useful content in the form of white papers, two books on business intelligence, and blog posts.