AWS Glue Data Catalog Primer
AWS Glue Data Catalog Primer
5h 1m

This section provides detail on the AWS management services relevant to the Solution Architect Associate exam. These services are used to help you audit, monitor and evaluate your AWS infrastructure and resources.  These management services form a core component of running resilient and performant architectures. 

Want more? Try a lab playground or do a Lab Challenge!

Learning Objectives

  • Understand the benefits of using AWS CloudWatch and audit logs to manage your infrastructure
  • Learn how to record and track API requests using AWS CloudTrail
  • Learn what AWS Config is and its components
  • Manage your accounts with AWS Organizations, including single sign-on with AWS SSO
  • Learn how to carry out logging with CloudWatch, CloudTrail, CloudFront, and VPC Flow Logs
  • Understand how to design cost-optimized architectures in AWS
  • Learn about AWS data transformation tools such as AWS Glue and data visualization services like Amazon Athena and QuickSight

AWS Glue historically was only an ETL service. Since then, the service has turned into a suite of data integration tools. Now, AWS Glue is made up of four different services: 

  1. Glue Data Catalog

  2. Glue Studio

  3. Glue DataBrew, and 

  4. Glue Elastic Views. Glue Elastic Views is out of scope for this content, so I won’t be talking about it in this lecture. If you’re interested in Glue Elastic Views, I will link a course specifically for that topic. 

In this lecture, I’ll mainly focus on the Glue Data Catalog aspect of this service.   

AWS defines the Glue Data Catalog as a central metadata repository. This means that it stores data about your data. This includes information like data format, data location, and schema. Here’s how it works: 

You upload your data to storage like Amazon S3, or a database like Amazon DynamoDB, Amazon Redshift, or Amazon RDS. From there, you can use a Glue Crawler to connect to your data source, parse through your data, and then infer the column name and data type for all of your data. The Crawler does this by using Classifiers, which actually read the data from your storage. You can use built-in Classifiers or custom Classifiers you write to identify your schema. 

Once it infers the schema, it will create a new catalog table with information about the schema, the metadata, and where the source data is stored.  You can have many tables filled with schema data from multiple sources. These tables are housed in what’s called a database. 

Note, that your data still lives in the location where you originally uploaded it, but now you also have a representation of the schema and metadata for that data in the catalog tables. This means your code doesn’t necessarily need to know where the data is stored and can reference the Data Catalog for this information instead. 

That’s it for this one. See you soon! 

About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.