The course is part of this learning path
In this course, we will compare Amazon EMR and AWS Glue and cover ways to make ETL processes more automated and repeatable.
Learning Objectives
- What AWS Glue is and how it works
- How AWS Glue compares to Amazon EMR
- How to make ETL processes more automated and repeatable using orchestration services such as AWS Data Pipeline, AWS Glue Workflows, and AWS Step Functions
Intended Audience
-
Those who are implementing and managing ETL on AWS
-
Those who are looking to take an AWS certification — specifically the AWS Certified Solutions Architect – Associate Certification or the AWS Certified Data Analytics - Specialty Certification
Prerequisites
In this course, I will provide introductory information on AWS Glue. However, to get the most from this course, you should already have an understanding of Amazon EMR and Amazon EC2. For more information on these services, please see our existing content titled:
Hello and welcome to this lecture where I’ll be discussing AWS Glue Studio, which is one of the tools available in the AWS Glue ecosystem. AWS Glue Studio is where you create, submit and monitor your ETL jobs.
With AWS Glue Studio, every ETL job consists of at least three things:
-
A data source. This could be the Data Catalog, or a service like Amazon S3, Amazon Kinesis, Redshift, RDS, DynamoDB or another JDBC source.
-
Then, you need a transformation script. Glue will use the data from your source and process it according to the transformation script you write. You can write these in either Python or Scala.
-
Last, you need a target. Glue will export the output to a target of your choice, such as the Data Catalog, Amazon S3, Redshift or a JDBC source.
Let’s look at Glue Studio in the Console. Here I am in the Job dashboard of the service. If I want to create a job, you can see there are many options to do so. However, they are categorized in one of two ways: I can either create a job programmatically or I can use a visual interface.
For example, if I click the visual with a blank canvas option and click create. I can then create graphical relationships between a source, transformation scripts, and a target destination.
Let’s build one quickly. I can use the Data Catalog as my source. For my transformation script, I’ll use a built-in script called Rename Field, that renames a key in my data set to another name. Then, I can output the transformation to an Amazon S3 bucket. I can additionally choose to update my Data Catalog or not.
While this is a pretty simple ETL job, you can create more complex relationships and graphs between services without coding at all, and Glue will generate the Apache Spark code for you behind the scenes. Note, that if you want a true no-code tool for creating ETL jobs, this won’t really provide you with that, as the built-in transformation scripts in Glue Studio are very limited. You only have about 10 options or so here. If you feel comfortable with coding, you can create custom transformation scripts in this interface using Python or Scala as well.
However, there are better places where you can develop your own custom scripts. For example, if I click back, you can see the other options for programmatically creating scripts, such as the Spark script editor, the Python shell script editor, or the built-in Jupyter Notebook interface to create Python or Scala job scripts.
That’s it for this one - see you next time.
Alana Layton is an experienced technical trainer, technical content developer, and cloud engineer living out of Seattle, Washington. Her career has included teaching about AWS all over the world, creating AWS content that is fun, and working in consulting. She currently holds six AWS certifications. Outside of Cloud Academy, you can find her testing her knowledge in bar trivia, reading, or training for a marathon.