AWS Glue


Distributed Machine Learning Concepts
Course Introduction
AWS Glue
3m 47s
Course Review
1m 26s
Start course
1h 26m

This training course begins with an introduction to the concepts of Distributed Machine Learning. We'll discuss the reasons as to why and when you should consider training your machine learning model within a distributed environment. 

Apache Spark

We’ll introduce you to Apache Spark and how it can be used to perform machine learning both at scale and speed. Apache Spark is an open-source cluster-computing framework.

Amazon Elastic Map Reduce

We’ll introduce you to Amazon’s Elastic MapReduce service, or EMR for short. EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data. EMR can be easily configured to host Apache Spark.

Spark MLlib

We’ll introduce you to MLlib which is Spark’s machine learning module. We’ll discuss how MLlib can be used to perform various machine learning tasks. For this course, we'll focus our attention on decision trees as a machine learning method which the MLlib module supports. A decision tree is a type of supervised machine learning algorithm used often for classification problems.

AWS Glue

We’ll introduce you to AWS Glue. AWS Glue is a fully managed extract, transform, and load service, ETL for short. We’ll show you how AWS Glue can be used to prepare our datasets before they are used to train our machine learning models.


Finally, we’ll show you how to use each of the aforementioned services together to launch an EMR cluster configured and pre-installed with Apache Spark for the purpose of training a machine learning model using a decision tree. This demonstration will provide an end-to-end solution that provides machine learning predictive capabilities.

Intended Audience

The intended audience for this course includes:

  • Data scientists and/or data analysts
  • Anyone interested in learning and performing distributed machine learning, or machine learning at scale
  • Anyone with an interest in Apache Spark and/or Amazon Elastic MapReduce

Learning Objectives

By completing this course, you will: 

  • Understand what Distributed machine learning is and what it offers
  • Understand the benefits of Apache Spark and Elastic MapReduce
  • Understand Spark MLlib as machine learning framework
  • Create your own distributed machine learning environment consisting of Apache Spark, MLlib, and Elastic MapReduce.
  • Understand how to use AWS Glue to perform ETL on your datasets in preparation for training a your machine learning model
  • Know how to operate and execute a Zeppelin notebook, resulting in job submission to your Spark cluster
  • Understand what a machine learning Decision Tree is and how to code one using MLlib


The following prerequisites will be both useful and helpful for this course:

  • A background in statistics or probability
  • Basic understanding of data analytics
  • General development and coding experience
  • AWS VPC networking and IAM security experience (for the demonstrations)

Course Agenda

The agenda for the remainder of this course is as follows:

  • We’ll discuss what Distributed Machine Learning is and when and why you might consider using it
  • We’ll review the Apache Spark application, and its MLlib machine learning module
  • We’ll review the Elastic MapReduce service
  • We’ll provide an understanding what a Decision Tree is - and what types of analytical problems it is suited towards
  • We’ll review the basics of using Apache Zeppelin notebooks - which can be used for interactive machine learning sessions
  • We’ll review AWS Glue. We’ll show you how you can use AWS Glue to perform ETL to prepare our datasets for ingestion into a machine learning pipeline.
  • Finally - We’ll present a demonstration of a fully functional distributed machine learning environment implemented using Spark running on top of an EMR cluster


If you have thoughts or suggestions for this course, please contact Cloud Academy at


- [Instructor] Welcome back. In this lecture we will introduce you to the AWS Glue service. AWS Glue is a fully managed, extract transforming load, ETL tool that makes it easy for you to prepare your machine learning data amongst other possibilities. Machine learning often requires you to collect and prepare data before it is used to train a machine learning model.

AWS Glue is a fully managed and serverless service. Interestingly, AWS Glue itself is built on top of a patchy spark. A patchy spark provides the underlying engine, which partitions data across multiple nodes, to achieve high throughput. As mentioned earlier, AWS Glue is a fully managed service. The glue service is composed of three main components. Data catalog.

The data catalog is a central meta data repository. The meta data required, when you see the pay crawler, is centralized and persisted into the data catalog. ETL engine. The ETL engine is used to perform ETL operations on the data sets discovered and registered within the data catalog. The ETL engine will automatically generate python pies back code based on supplied configuration of source and destination data sets.

Or alternatively, you can handcraft your own ETL scripts from scratch. Job scheduler. The job scheduler is used to trigger your ETL jobs based on supplied configuration. The job scheduler performs additional tasks, such as job monitoring and retries. The first thing you would generally do with an AWS Glue is to crawl your data sources. AWS Glue can be figured to crawl data sets stored in these three or databases via JDBC connections. After the crawler is set up and activated, AWS Glue performs a crawl and derives a data schemer storing this and other associated meta data into the AWL Glue data catalog.

AWS Glue provides an ETL tool that allows you to create and configure ETL jobs. Using the AWS Glue server's console you can simply specify input and output labels registered in the data catalog. Next, you specify the magnets between the input and output table schemers.

AWS Glue will then auto-generate an ETL script using PySpark. PySpark is the Spark Python API. The screen show here displays an example Glue ETL job. The PySpark script on the right-hand side has been auto-generated based on the initial user provided configuration. At this stage, you are free to update and refine the specifics of the script.

Later on in our DMI, we'll use this to prepare our data before it is used to train our decision tree machine learning model. The last thing you tend to do within AWS Glue is to configure the scheduling of your ETL jobs. You can use the scheduler to trigger jobs to run based on a schedule or completion of another ETL job or on the map.

That concludes this lecture on AWS Glue. In the next lecture, we'll provide an interim demonstration of a distributed machine learning architecture that integrates all of the services we've reviewed. Go ahead and close this lecture and we'll see you shortly in the next one.

About the Author
Learning Paths

Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.

He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, Azure, GCP), Security, Kubernetes, and Machine Learning.

Jeremy holds professional certifications for AWS, Azure, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).