ETL with AWS Glue Studio
Start course
6h 2m

This section of the AWS Certified Solutions Architect - Professional learning path introduces the AWS management and governance services relevant to the AWS Certified Solutions Architect - Professional exam. These services are used to help you audit, monitor, and evaluate your AWS infrastructure and resources and form a core component of resilient and performant architectures. 

Want more? Try a Lab Playground or do a Lab Challenge!

Learning Objectives

  • Understand the benefits of using AWS CloudWatch and audit logs to manage your infrastructure
  • Learn how to record and track API requests using AWS CloudTrail
  • Learn what AWS Config is and its components
  • Manage multi-account environments with AWS Organizations and Control Tower
  • Learn how to carry out logging with CloudWatch, CloudTrail, CloudFront, and VPC Flow Logs
  • Learn about AWS data transformation tools such as AWS Glue and data visualization services like Amazon Athena and QuickSight
  • Learn how AWS CloudFormation can be used to represent your infrastructure as code (IaC)
  • Understand SLAs in AWS

Hello and welcome to this lecture where I’ll be discussing AWS Glue Studio, which is one of the tools available in the AWS Glue ecosystem. AWS Glue Studio is where you create, submit and monitor your ETL jobs. 

With AWS Glue Studio, every ETL job consists of at least three things: 

  1. A data source. This could be the Data Catalog, or a service like Amazon S3, Amazon Kinesis, Redshift, RDS, DynamoDB or another JDBC source. 

  2. Then, you need a transformation script. Glue will use the data from your source and process it according to the transformation script you write. You can write these in either Python or Scala. 

  3. Last, you need a target. Glue will export the output to a target of your choice, such as the Data Catalog, Amazon S3, Redshift or a JDBC source. 

Let’s look at Glue Studio in the Console. Here I am in the Job dashboard of the service. If I want to create a job, you can see there are many options to do so. However, they are categorized in one of two ways: I can either create a job programmatically or I can use a visual interface. 

For example, if I click the visual with a blank canvas option and click create. I can then create graphical relationships between a source, transformation scripts, and a target destination.

Let’s build one quickly. I can use the Data Catalog as my source. For my transformation script, I’ll use a built-in script called Rename Field, that renames a key in my data set to another name. Then, I can output the transformation to an Amazon S3 bucket. I can additionally choose to update my Data Catalog or not. 

While this is a pretty simple ETL job, you can create more complex relationships and graphs between services without coding at all, and Glue will generate the Apache Spark code for you behind the scenes. Note, that if you want a true no-code tool for creating ETL jobs, this won’t really provide you with that, as the built-in transformation scripts in Glue Studio are very limited. You only have about 10 options or so here. If you feel comfortable with coding, you can create custom transformation scripts in this interface using Python or Scala as well.

However, there are better places where you can develop your own custom scripts. For example, if I click back, you can see the other options for programmatically creating scripts, such as the Spark script editor, the Python shell script editor, or the built-in Jupyter Notebook interface to create Python or Scala job scripts.

That’s it for this one - see you next time. 


About the Author
Learning Paths

Danny has over 20 years of IT experience as a software developer, cloud engineer, and technical trainer. After attending a conference on cloud computing in 2009, he knew he wanted to build his career around what was still a very new, emerging technology at the time — and share this transformational knowledge with others. He has spoken to IT professional audiences at local, regional, and national user groups and conferences. He has delivered in-person classroom and virtual training, interactive webinars, and authored video training courses covering many different technologies, including Amazon Web Services. He currently has six active AWS certifications, including certifications at the Professional and Specialty level.