1. Home
  2. Training Library
  3. Serverless, Component Decoupling, and Solution Architectures (SAP-C02)

EMR Characteristics


Course Introduction
Utilizing Managed Services and Serverless Architectures to Minimize Cost
Decoupled Architecture
Amazon API Gateway
Advanced API Gateway
PREVIEW11m 29s
Amazon Elastic Map Reduce
Introduction to EMR
Amazon EventBridge
Design considerations

The course is part of this learning path

Start course
4h 43m

This section of the AWS Certified Solutions Architect - Professional learning path introduces common AWS solution architectures relevant to the AWS Certified Solutions Architect - Professional exam and the services that support them. These services form a core component of running resilient and performant architectures. 

Want more? Try a Lab Playground or do a Lab Challenge!

Learning Objectives

  • Learn how to utilize managed services and serverless architectures to minimize cost
  • Understand how to use AWS services to process streaming data
  • Discover AWS services that support mobile app development
  • Understand when to utilize serverless services within your AWS solutions
  • Learn which AWS services to use when building a decoupled architecture

As this is an introductory level course, I will not deep dive on Hadoop distributions or the Map Reduce algorithms. Instead and to keep things at a higher level, let me first discuss EMRs simplicity. You can very quickly go from having no EMR resources within AWS to a full EMR cluster to process your data in just a few clicks in the console.  A few minutes later AWS EMR will have orchestrated all the resources required to carry out the required processing. 

This is a huge improvement over traditional provisioning. You save time and money as you're ready to do your own analysis in a matter of minutes, not hours or days. And if you have experience of creating a manual Hadoop cluster setup in the past, then you will know that this saves a huge amount of time when it comes to effort and provisioning your own hardware.

The pricing model is also very attractive. Traditionally, you would need to acquire all hardware and related equipment to support the maximum load for your analytics. You would need several servers to compose your cluster, plus implement highly available storage and networking, and that’s without the cost of energy and cooling within your own data center. 

However, with EMR the cost model is based on a pay as you go basis, meaning you will only pay for the provisioned resources when your cluster is running. You can choose among different EC2 instance types and sizes for different performance. If you need strong processing power for a small time, you can run 100 nodes for 1 hour instead of 10 nodes for 10 hours, and you pay the same price. You can also take benefit of EC2 pricing models such as Spot instances and EC2 reserved instances. 

Elasticity is another intrinsic cloud characteristic, as there is the ability to grow or shrink the resources whenever it’s required based on the demand. You can add or remove nodes on the fly. For example, if during the creation of your cluster you underestimated the resources required, you can add more core or task nodes to it. 

Security is an important factor for AWS and all of its services, and this is no different here for EMR. Your instances are secured by EC2 security groups, segmented by one security group for the master node, and others for the other node types which do not have external access by default. Of course, you can modify this behavior, but it's highly recommended not to open the world up to your cluster.  The data is also secured on Amazon S3 and you can enable auditing with CloudTrail. 

AWS has always taken great care to make services that easily integrate with other services to reduce as much as possible complexity into managing, operating, and getting the pieces together when you're building your applications. It's no different for EMR, where several storage services are available for you and easily integrated. S3 for example, is the most commonly used storage of its durability, low price, and infinite storage capacity. But you can also store your data in the cluster itself by taking benefit of the Hadoop file system. You may also store your results back to a DynamoDB table and if you want to leverage your existing BI tools and data warehousing infrastructure, you can make your results available to Redshift. Or save them for archival purposes with Amazon Glacier and also it's possible to put them into a relational database system.

About the Author
Learning Paths

Danny has over 20 years of IT experience as a software developer, cloud engineer, and technical trainer. After attending a conference on cloud computing in 2009, he knew he wanted to build his career around what was still a very new, emerging technology at the time — and share this transformational knowledge with others. He has spoken to IT professional audiences at local, regional, and national user groups and conferences. He has delivered in-person classroom and virtual training, interactive webinars, and authored video training courses covering many different technologies, including Amazon Web Services. He currently has six active AWS certifications, including certifications at the Professional and Specialty level.