In this course for the Big Data Specialty Certification, we learn how to identify the appropriate data processing technologies needed for big data scenarios. We explore how to design and architect a data processing solution, and explore and define the operational characteristics of big data processing.
Learning objectives
- Recognize and explain how to identify the appropriate data processing technologies needed for big data scenarios.
- Recognize and explain how to design and architect a data processing solution.
Intended audience
This course is intended for students wanting to extend their knowledge of the data processing options available in AWS.
Prerequisites
While there are no formal prerequisites for this course, students will benefit from having a basic understanding of cloud computing services. If you would like to gain a solid foundation in compute fundamentals, then check out our Compute Fundamentals For AWS course.
This Course Includes
75 minutes of high-definition video.
What You'll Learn
- Course Intro: What to expect from this course
- Amazon Elastic MapReduce Overview: In this lesson, we discuss how EMR allows you to store and process data
- Amazon Elastic MapReduce Architecture: In this lesson, you’ll learn about EMR’s clustered architecture.
- Amazon Elastic MapReduce in Detail: In this lesson, we’ll dig deeper into EMR storage options, resource management, and processing options.
- Amazon Elastic MapReduce Reference Architecture: Best practices for using EMR.
- Amazon Lambda Introduction: This lesson will kick off our discussion of Lambda and how it’s used in Big Data scenarios.
- Amazon Lambda Overview: This lesson discusses how Lambda allows you to run code for virtually any type of application or backend service with no administration.
- AWS Lambda Architecture: In this lesson, we’ll discuss generic Lambda architecture and Amazon’s serverless service.
- AWS Lambda in Detail: In this lesson, we’ll dig into Events and Service Limits.
- AWS Lambda Reference Architecture: In this lesson, we'll look at a real-life scenario of how lambda can be used.
So let's have a look at the architecture that is underpinning the Amazon EMR service. Amazon EMR is based on a Clustered architecture, often referred to as a distributed architecture. The core container of the Amazon EMR platform is called a Cluster.
A Cluster is composed of one or more elastic compute cloudinstances, called Slave Nodes. Slave Nodes are the wiki node. They execute the code that processes the data. Each Cluster has a Master Node. The Master Node manages the Cluster by running the centralized software components, which coordinates the distribution of data and tasks amongst the Slaves for processing. The Master Node also tracks the status of these tasks and the health of the Cluster. Each Amazon EMR Cluster will only have one Master Node.
There are two types of Slave Nodes: A Core Node and a Task Node. Amazon EMR installs different software components on each node type, giving each node a role in the distributed application. The Core Node is a Slave Node, which stores data in the Hadoop Distributor File System, or HDFS and also runs tasks. A Task Slave Node only runs tasks, it does not store data. Task Nodes are optional when you're creating an EMR Cluster.
When scaling down your EMR environment you cannot remove Core Nodes but you can remove Task Nodes. This is because Core Nodes hold the data, so if you were to remove them you would lose that data. Whereas Task Nodes do not hold data so they can be used to scale your Cluster computer power up and down. You are able to provision a single Node Cluster and in this scenario the single node provides both the role of the Master and the Slave.
As we've just outlined, Amazon EMR is based on a Clustered architecture. Unlike the Clustered architecture provided by services such as Amazon Redshift, where the Cluster components are predetermined and you just need to define how big you want your environment to be. With Amazon EMR you get a number of choices on what you use for each component within the Amazon EMR architecture. The reason for this is Amazon EMR leverages the Hadoop Ecosystem and its related components. In this ecosystem there are currently multiple, different open source components you can utilize to do some of the tasks.
I liken it to a cooking recipe. You can often use sugar, honey or molasses in recipe. They will all sweeten the dish you create but each will provide a different flavor to the final dish. So it is with Amazon EMR. You are able to use different components at each of the layers but they all have pros and cons and they will change how your Amazon EMR solution will work or behave.
The key three layers you need to worry about for your EMR Cluster are what you use for storage, processing and access. As you can see there are a lot of options on the components you can use at each layer. Before we go into each of these options in detail, let's have quick look at how AWS makes things easier for you. One of the great things about AWS is that they always try to make things easy for you, so when you go to create a new Amazon EMR Cluster in the console there are number of pre-baked recipes that will install the required applications at each layer for you.
Each of these software configuration options will automatically install the listed combination of applications when you provision your EMR Cluster. As your open source applications, the version of each application will tend to change on a regular basis. So that is something you need to be cognizant of. An important option to note is that there are two launch modes you can use when creating your EMR environment, a Persistent Cluster and a Transient Cluster.
When you select Cluster as the launch mode, the EMR Cluster is created as a Persistent Cluster and continues running until you decide to terminate it. When you select Step Execution, you define what steps you want the EMR environment to run. Those steps drive the applications that are automatically installed when you create the Cluster. Once those steps are executed, the Cluster is automatically terminated.
You can also choose which applications you want to automatically install by clicking on the Advanced Options at the top. When you use the Advanced Options, you get to select the applications you want automatically installed within the EMR Cluster. This way you can mix and match your recipes as much as you want to. Let's look at the potential recipe ingredients for you Amazon EMR environment.
The first thing you need to decide is what storage component you want to use. You have three choices. You can use the Hadoop Distributed File System or HDFS, the EMR file system or EMRFS, or a local file system. We will talk through each of these storage options and details in a few minutes. The Resource Management layer is responsible for managing the Cluster resources and scheduling the jobs for processing data.
Earlier versions of Amazon EMR use MapReduce while later versions use YARN. The next thing to choose is the processing engine you wish to use. At this time there are currently four choices: Hadoop MapReduce with TEZ, Presto, Hbase or Spark.
The last layer to decide on is the applications and tools that will sit on top of the EMR storage and processing engine to allow you to enter it with a required data and code. Amazon EMR supports many applications, such as Hive, Pig and the Spark streaming libraries to provide capabilities such as using high level languages to concern processing workloads, leveraging machine learning algorithms, making stream-processing implications and building data warehouses. We'll look at each of these in detail next.
Shane has been emerged in the world of data, analytics and business intelligence for over 20 years, and for the last few years he has been focusing on how Agile processes and cloud computing technologies can be used to accelerate the delivery of data and content to users.
He is an avid user of the AWS cloud platform to help deliver this capability with increased speed and decreased costs. In fact its often hard to shut him up when he is talking about the innovative solutions that AWS can help you to create, or how cool the latest AWS feature is.
Shane hails from the far end of the earth, Wellington New Zealand, a place famous for Hobbits and Kiwifruit. However your more likely to see him partake of a good long black or an even better craft beer.