Apache Spark


Distributed Machine Learning Concepts
Course Introduction
AWS Glue
3m 47s
Course Review
1m 26s
Start course
1h 26m

This training course begins with an introduction to the concepts of Distributed Machine Learning. We'll discuss the reasons as to why and when you should consider training your machine learning model within a distributed environment. 

Apache Spark

We’ll introduce you to Apache Spark and how it can be used to perform machine learning both at scale and speed. Apache Spark is an open-source cluster-computing framework.

Amazon Elastic Map Reduce

We’ll introduce you to Amazon’s Elastic MapReduce service, or EMR for short. EMR provides a managed Hadoop framework that makes it easy, fast, and cost-effective to process vast amounts of data. EMR can be easily configured to host Apache Spark.

Spark MLlib

We’ll introduce you to MLlib which is Spark’s machine learning module. We’ll discuss how MLlib can be used to perform various machine learning tasks. For this course, we'll focus our attention on decision trees as a machine learning method which the MLlib module supports. A decision tree is a type of supervised machine learning algorithm used often for classification problems.

AWS Glue

We’ll introduce you to AWS Glue. AWS Glue is a fully managed extract, transform, and load service, ETL for short. We’ll show you how AWS Glue can be used to prepare our datasets before they are used to train our machine learning models.


Finally, we’ll show you how to use each of the aforementioned services together to launch an EMR cluster configured and pre-installed with Apache Spark for the purpose of training a machine learning model using a decision tree. This demonstration will provide an end-to-end solution that provides machine learning predictive capabilities.

Intended Audience

The intended audience for this course includes:

  • Data scientists and/or data analysts
  • Anyone interested in learning and performing distributed machine learning, or machine learning at scale
  • Anyone with an interest in Apache Spark and/or Amazon Elastic MapReduce

Learning Objectives

By completing this course, you will: 

  • Understand what Distributed machine learning is and what it offers
  • Understand the benefits of Apache Spark and Elastic MapReduce
  • Understand Spark MLlib as machine learning framework
  • Create your own distributed machine learning environment consisting of Apache Spark, MLlib, and Elastic MapReduce.
  • Understand how to use AWS Glue to perform ETL on your datasets in preparation for training a your machine learning model
  • Know how to operate and execute a Zeppelin notebook, resulting in job submission to your Spark cluster
  • Understand what a machine learning Decision Tree is and how to code one using MLlib


The following prerequisites will be both useful and helpful for this course:

  • A background in statistics or probability
  • Basic understanding of data analytics
  • General development and coding experience
  • AWS VPC networking and IAM security experience (for the demonstrations)

Course Agenda

The agenda for the remainder of this course is as follows:

  • We’ll discuss what Distributed Machine Learning is and when and why you might consider using it
  • We’ll review the Apache Spark application, and its MLlib machine learning module
  • We’ll review the Elastic MapReduce service
  • We’ll provide an understanding what a Decision Tree is - and what types of analytical problems it is suited towards
  • We’ll review the basics of using Apache Zeppelin notebooks - which can be used for interactive machine learning sessions
  • We’ll review AWS Glue. We’ll show you how you can use AWS Glue to perform ETL to prepare our datasets for ingestion into a machine learning pipeline.
  • Finally - We’ll present a demonstration of a fully functional distributed machine learning environment implemented using Spark running on top of an EMR cluster


If you have thoughts or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.


- Welcome back. In this lecture, we'll introduce you to Apache Spark. Apache Spark is an opensource cluster computing framework, used to provide lightning fast distributed computation. Apache Spark can be used to provision a cluster of machines, configured in a manner that provides a general purpose distributed computing engine, capable of processing very large amounts of data.

When using Spark, your datasets are partitioned and spread across the Spark cluster, allowing the cluster to process the data in parallel. Apache Spark is designed to process data in memory, and this alone really differentiates it from alternative distributed computing platforms. By storing and processing data and memory, a huge performance boost is gained. Apache Spark itself is built in Scala.

Scala is a typesafe JVM language that incorporates both object orientated, and functional programming into an extremely concise, logical and extraordinary powerful language. As we'll see later in this course, Apache Spark can be easily and quickly launched on top of an Amazon EMR cluster, through some simple configuration options. With this point in mind, you can start to see why EMR and Spark provide an excellent platform for processing large datasets. Central to the Apache Spark technology stack, is the Spark core.

The Spark core is the foundation layer, and provides distributed task dispatching, scheduling and basic IO functionalities. The Spark core is exposed through an API, as can be seen in this diagram. Interfaces to the API are available in Scala, Python, Java, and R. The Spark platform comes with in-built modules for SQL, streaming, machine learning, and graphs. In this course, we'll focus on the MLlib module that provides us with our distributed machine learning capabilities.

The demo that we give later in this course, will involve us building a Scala MLLib implemented decision tree, to train a decision tree model. Apache Spark is built using a master slave type infrastructure. The master node acts as a central controller and coordinator. This slide illustrates the main software components within the Spark master slave architecture.

When we launch an EMR cluster with Spark enabled, AWS takes care of the installation of Spark. This is the great thing about running Spark on top of EMR. You get the installation and deployment of Spark done automatically for you. The ability to launch a working Spark environment, hosted in AWS, on an EMR cluster, is both very simple and quick, allowing you to focus and maintain your energies into the data science itself.

Regardless of this, we'll quickly go over each of the main components, to ensure that you have a basic understanding of the deployed Spark topology. The driver is the process where the main method runs. First, it converts your user program into tasks, and after that it schedules the tasks on the executors. The driver program communicates with the cluster manager, to distribute tasks to the executors. The cluster manager, as the name indicates, manages the cluster.

Spark has the ability to work with a multitude of cluster managers, including YARN, Mesos, and a standalone cluster manager. By default, EMR will launch Spark with YARN, as the cluster manager of choice. Worker nodes are the machines where the actual work is performed. Worker nodes host Spark executor processes. An EMR core node is the equivalent of a Spark worker node.

Executors are the individual JVM processes that are launched within a worker. An executor process has the responsibility of running an individual task, for a specific Spark job. They are launched at the beginning of a Spark application, and typically run for the entire lifetime of an application. Once the task is completed, the executor sends the results back to the driver.

Executors also provide a memory caching of local data. A task is a unit of work that the driver program sends to the executor JVM process to be launched. A SparkContext is the entry point to the Spark core, and for any work you wish to submit into the Spark cluster, a SparkContext provides you an access point into the Spark execution environment, and acts as the controller of your Spark application.

Apache Spark has several different but related data abstractions. Let's briefly cover off each of the different abstractions, and what their key benefits are. An RDD, or Resilient Distributed Dataset, is the original data abstraction that Spark used, and still uses. An RDD provides a fault tolerant collection of items, that can be computed on in parallel. It is Spark's representation of a dataset, partitioned over a cluster of machines.

An API is provided, which enables you to manipulate it. RDD's can be created from various data formats, and sources such as CSV files, JSON files, and/or databases via JDBC connections. DataFrames were next introduced. DataFrames organize data into named columns, similar to a database table. A key objective of DataFrames is to make dataset processing easier and more accessible to general users.

Later on in our demo, we'll leverage Spark DataFrames to access CSV stored data. The last abstraction, the DataSet, builds on the success of the DataFrame, by providing an additional typesafe, object oriented programming interface, that provides access to a strongly typed, immutable collection of objects, that are mapped to a relational schema. Spark features an advanced Directed Acyclic Graph, or DAG for short. Each Spark job created results in a DAG.

The DAG represents a sequence of task stages to be performed on the cluster. DAGs created by Spark can contain any number of stages. As can be seen on the slide, your Spark code gets converted into a DAG, which in turn is passed to the DAG scheduler. The DAG scheduler splits the graph into stages of tasks, and distributes them as task sets to the task scheduler on the cluster manager. Finally, the individual tasks are delivered to an assigned executor process, on a particular worker node.

Later on in this course, when we build our demo, you'll see that all of this internal data management is taken care of for us. We simply as data scientists create our data scripts, and submit as jobs to Spark. Spark performs all of the above under the hood.

That concludes this lecture. Go ahead an close this, and we'll see you soon in the next lecture, where we introduce you to Spark MLlib. MLlib is Apache Spark's scalable machine learning library.

About the Author
Learning Paths

Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.

He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, Azure, GCP), Security, Kubernetes, and Machine Learning.

Jeremy holds professional certifications for AWS, Azure, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).