If you have data that needs to be subjected to analytics, then you will likely need to put that data through an extract, transform and load (ETL) process, AWS Glue is a fully managed service designed to do just this. Through a series of simple configurable options, you can select your source data to be processed by AWS Glue allowing you to turn it into cataloged, searchable and queryable data. This course will take you through the fundamentals of AWS Glue to get you started with this service
The objectives of this course are to provide you with and understanding of:
- Serverless ETL
- The knowledge and architecture of a typical ETL project
- The prerequisite setup of AWS parts to use AWS Glue for ETL
- Knowledge of how to use AWS Glue to perform serverless ETL
- How to edit ETL processes created from AWS Glue
This course is ideal for:
- Data warehouse engineers that are looking to learn more about serverless ETL and AWS Glue
- Developers that want to learn more about ETL work using AWS Glue
- Developer leads that want to learn more about the serverless ETL process
- Project managers and owners that want to learn about data preparation
As a prerequisite to this course you should have familiarity with:
- One ore more of the data storage destinations offered by AWS
- Data warehousing principles
- Serverless computing
- Object-orientated programming (Python)
We welcome all feedback and suggestions - please contact us at firstname.lastname@example.org if you are unsure about where to start or if would like help getting started.
Hello and welcome to this lecture where I shall be discussing an overview of traditional ETL in comparison to AWS Glue.
Now ETL stands for extract transform load. And this is the common paradigm by which data from multiple systems is combined to a single database, data store or warehouse for legacy storage or analytics.
Extraction is the process of retrieving data from one or more sources, online, brick and mortar, legacy data, Salesforce data and many others. After retrieving the data, ETL is to compute work that loads it into a staging area and prepares it for the next phase. Transformation is the process of mapping, reformatting, conforming, adding meaning and more to prepare the data in a way that is more easily consumed. One example of this is the transformation and computation where currency amounts are converted from US dollars to Euros. Loading involves successfully inserting the transformed data into the target database, data store or data warehouse. All of this work is processed in what the business intelligent developers call an ETL job.
The advent of cloud, SaaS, and big data has produced an explosion in the number of new data sources and streams. In addition, there was a larger variety of data sources, a larger volume of data, and the need to move data faster between systems around the world. As a result, there's been a spike in demand for more sophisticated data integration tools that can handle greater volume, velocity, and then increase in variety. Additionally over time techniques have evolved to help solve data integration issues and improve the speed to preparing data for decision support. Many of these improvements have been made on the on premise tools, in addition to being able to connect with cloud resources. On the other hand, ETL tools have also been developed in the cloud, which in some cases do a better job of handling big data concerns in addition to connecting to cloud resources.
Remember though that if we take a step back we will acknowledge that the ETL work to prepare data is not the reward. The reward or value comes with the analysis and consuming of the prepared data as shown in this chart. One of the improvements made to ETL technology is to provide realtime or near realtime analytics. Another improvement is to simplify and reduce the cost of ETL work. This can be accomplished in part by providing or maintaining the infrastructure on which ETL jobs run on. The name for code that runs in the code with little or no knowledge of what is provisioned is serverless ETL.
The involvement of serverless ETL has the benefit of simplifying the ETL process, such as allowing data developers more time to focus on preparing data and building out data pipelines rather than considerations for on premise software consistent on which ETL jobs and run. Another benefit is a more seamless exchange heterogeneous data sources and data destinations that are already in the cloud the following is a table that helps show the differences between serverless ETL and traditional ETL. The differences include the benefits of serverless code that run on AWS service. AWS Glue provides a UI that allows you to build out the source and destination for the ETL job and auto generates a serverless code for you. The next lecture gives you a thorough review of AWS Glue. More and more you will likely see source and destination tables reside in the cloud. And it makes sense that ETL as a service will become more popular.
The ETL process does not change very much between on premise ETL and serverless ETL and ETL as a service. The following diagram shows the same ETL process as shown in the previous slide. By the ETL work and processes are performed in the cloud with AWS Glue. Not the following, we will continue to generate data, there is still the need to profile the source data and determine what to extract, collect and store, companies still want to correctly extract and transform less structured data and transform it and prepare it for loading to a single location such as a data warehouse like Amazon Redshift and the ultimate goal is to more easily and effectively analyze the data for better decision support.
About the Author
Move a metric, change products or behaviors...with data -that is what excites me. I am passionate about data and have worked to architect and develop data solutions using cloud and on-premise ETL and visualization tools. I am an evangelist for self-service data transformation, insights, and analytics. I love to be agile.
I extend my understanding to the community by giving presentations at Big Data Conferences, Code Camp, and other venues. I also write useful content in the form of white papers, two books on business intelligence, and blog posts.