1. Home
  2. Training Library
  3. Serverless, Component Decoupling, and Solution Architectures (SAP-C02)

EMR Architecture


Course Introduction
Utilizing Managed Services and Serverless Architectures to Minimize Cost
Decoupled Architecture
Amazon API Gateway
Advanced API Gateway
PREVIEW11m 29s
Amazon Elastic Map Reduce
Introduction to EMR
Amazon EventBridge
Design considerations

The course is part of this learning path

Start course
4h 43m

This section of the AWS Certified Solutions Architect - Professional learning path introduces common AWS solution architectures relevant to the AWS Certified Solutions Architect - Professional exam and the services that support them. These services form a core component of running resilient and performant architectures. 

Want more? Try a Lab Playground or do a Lab Challenge!

Learning Objectives

  • Learn how to utilize managed services and serverless architectures to minimize cost
  • Understand how to use AWS services to process streaming data
  • Discover AWS services that support mobile app development
  • Understand when to utilize serverless services within your AWS solutions
  • Learn which AWS services to use when building a decoupled architecture

There are three fundamental architectural components when it comes to Amazon EMR, all of which are types of nodes: Master Nodes, Core Nodes, and Task Nodes.

Firstly, what is a node? An individual node is an instance that forms the basis of the EMR cluster, the actual processing power that allows you to run your EMR jobs to process your data. 

As I explained, the cluster itself contains three different node types, let’s take a look at each of them individually, starting with the Master Node.

There is only every one Master node per cluster, and it is responsible for actually managing the EMR cluster and runs processes to work with distributed application processing. An example of these processes include both the YARN ResourceManager and HDFS NameNode services. These allow the Master node to manage resources for your applications using the cluster.  The Master node also keeps track of your EMR jobs that have been submitted and contains Hadoop logs files that can be accessed if you connect locally to the Master node. 

The Core node is next in the chain of nodes following the Master, as a result, the Core node is managed by the Master node. More than one core node can be provisioned allowing you to implement instance fleets.  Instance fleets allow you to provision multiple core nodes using up to 5 different instance types. Within the instance fleet, you can add or remove instances as required using automatic scaling while the cluster is in use.

Core nodes also run a number of processes, much like the Master node, these include the Data Node daemon which manages data storage within the Hadoop Distributed File System (HDFS). Also, the core nodes run the Task Tracker daemon allowing parallel processing to take place

Out of the three node types available, the Task nodes are the only ones that are optional within your cluster. Utilizing Task nodes allows you to implement a level of parallel compute operations against your data using Hadoop MapReduce tasks or Spark executors. Again, using automatic scaling, you can automatically scale your task nodes within your instance fleet.  

To optimize your costs, using Spot instances within your instance fleet for your task nodes is a very effective way to carry out the processing of your EMR jobs. 

More information on how Spot instances can be used for EMR cost optimization across your tasks can be found in this webinar, with guest speaker Chad Shmutzer, a Principal Developer Advocate from AWS.

When you provision your cluster, you need to think about the amount of data you want to store on the core nodes, which run HDFS. For processing, you can add as many task nodes as you want, and if you over-provision your cluster, you can reduce the number of task nodes on the fly without interrupting the running jobs.

When it comes to adding or removing capacity, you can deploy multiple clusters if you need more capacity and you can easily launch a new cluster and terminate it when you no longer need it. There is no limit to how many clusters you can have. You may want to use multiple clusters if you have multiple users or different jobs to run. For example, you can store your input data in S3 and launch one cluster for each application that needs to process this data. One cluster might be optimized for CPU and another one for storage, and you can also resize the running cluster. With Amazon EMR, it's easy to resize a running cluster. You may want to resize a cluster to temporarily add more processing power or to shrink your cluster to save money.

That now brings me to the end of this lecture and to the end of this introductory course covering Amazon Elastic MapReduce. You should now have a greater understanding of what this service is used for and how it can fit into your data analytic solutions.

If you have any feedback, positive or negative, please do contact us at support@cloud academy.com. Your feedback is greatly appreciated. Thank you for your time and good luck with your continued learning of cloud computing. 

=Thank you.

About the Author
Learning Paths

Danny has over 20 years of IT experience as a software developer, cloud engineer, and technical trainer. After attending a conference on cloud computing in 2009, he knew he wanted to build his career around what was still a very new, emerging technology at the time — and share this transformational knowledge with others. He has spoken to IT professional audiences at local, regional, and national user groups and conferences. He has delivered in-person classroom and virtual training, interactive webinars, and authored video training courses covering many different technologies, including Amazon Web Services. He currently has six active AWS certifications, including certifications at the Professional and Specialty level.