image
EMR Architecture
Start course
Difficulty
Beginner
Duration
11m
Students
3036
Ratings
4.7/5
starstarstarstarstar-half
Description

This course provides an introduction to the big data processing service known as Amazon Elastic Map Reduce, commonly referred to as EMR. You will learn the characteristics of the service and its base architecture.

If you have any feedback relating to this course, feel free to contact us at support@cloudacademy.com.

Learning Objectives

The objectives of this course are to provide a foundational understanding of Amazon Elastic MapReduce, allowing you to learn what it is, some of its characteristics, and its base architecture.

Intended Audience

This course is ideal for those looking to become a data scientist or a solutions architect. Also, if you are studying for the AWS Data Analytics - Specialty certification, then this provides a great insight into EMR before diving deeper on the service.

Prerequisites

To get the most from this course, you should have a basic knowledge of the AWS platform. Some understanding of big data processing would also be beneficial.

Transcript

There are three fundamental architectural components when it comes to Amazon EMR, all of which are types of nodes: Master Nodes, Core Nodes, and Task Nodes.

Firstly, what is a node? An individual node is an instance that forms the basis of the EMR cluster, the actual processing power that allows you to run your EMR jobs to process your data. 

As I explained, the cluster itself contains three different node types, let’s take a look at each of them individually, starting with the Master Node.

There is only every one Master node per cluster, and it is responsible for actually managing the EMR cluster and runs processes to work with distributed application processing. An example of these processes include both the YARN ResourceManager and HDFS NameNode services. These allow the Master node to manage resources for your applications using the cluster.  The Master node also keeps track of your EMR jobs that have been submitted and contains Hadoop logs files that can be accessed if you connect locally to the Master node. 

The Core node is next in the chain of nodes following the Master, as a result, the Core node is managed by the Master node. More than one core node can be provisioned allowing you to implement instance fleets.  Instance fleets allow you to provision multiple core nodes using up to 5 different instance types. Within the instance fleet, you can add or remove instances as required using automatic scaling while the cluster is in use.

Core nodes also run a number of processes, much like the Master node, these include the Data Node daemon which manages data storage within the Hadoop Distributed File System (HDFS). Also, the core nodes run the Task Tracker daemon allowing parallel processing to take place

Out of the three node types available, the Task nodes are the only ones that are optional within your cluster. Utilizing Task nodes allows you to implement a level of parallel compute operations against your data using Hadoop MapReduce tasks or Spark executors. Again, using automatic scaling, you can automatically scale your task nodes within your instance fleet.  

To optimize your costs, using Spot instances within your instance fleet for your task nodes is a very effective way to carry out the processing of your EMR jobs. 

More information on how Spot instances can be used for EMR cost optimization across your tasks can be found in this webinar, with guest speaker Chad Shmutzer, a Principal Developer Advocate from AWS.

When you provision your cluster, you need to think about the amount of data you want to store on the core nodes, which run HDFS. For processing, you can add as many task nodes as you want, and if you over-provision your cluster, you can reduce the number of task nodes on the fly without interrupting the running jobs.

When it comes to adding or removing capacity, you can deploy multiple clusters if you need more capacity and you can easily launch a new cluster and terminate it when you no longer need it. There is no limit to how many clusters you can have. You may want to use multiple clusters if you have multiple users or different jobs to run. For example, you can store your input data in S3 and launch one cluster for each application that needs to process this data. One cluster might be optimized for CPU and another one for storage, and you can also resize the running cluster. With Amazon EMR, it's easy to resize a running cluster. You may want to resize a cluster to temporarily add more processing power or to shrink your cluster to save money.

That now brings me to the end of this lecture and to the end of this introductory course covering Amazon Elastic MapReduce. You should now have a greater understanding of what this service is used for and how it can fit into your data analytic solutions.

If you have any feedback, positive or negative, please do contact us at support@cloud academy.com. Your feedback is greatly appreciated. Thank you for your time and good luck with your continued learning of cloud computing. 

=Thank you.

About the Author
Students
228618
Labs
1
Courses
215
Learning Paths
178

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.