image
EMR Characteristics
Start course
Difficulty
Beginner
Duration
11m
Students
3050
Ratings
4.7/5
starstarstarstarstar-half
Description

This course provides an introduction to the big data processing service known as Amazon Elastic Map Reduce, commonly referred to as EMR. You will learn the characteristics of the service and its base architecture.

If you have any feedback relating to this course, feel free to contact us at support@cloudacademy.com.

Learning Objectives

The objectives of this course are to provide a foundational understanding of Amazon Elastic MapReduce, allowing you to learn what it is, some of its characteristics, and its base architecture.

Intended Audience

This course is ideal for those looking to become a data scientist or a solutions architect. Also, if you are studying for the AWS Data Analytics - Specialty certification, then this provides a great insight into EMR before diving deeper on the service.

Prerequisites

To get the most from this course, you should have a basic knowledge of the AWS platform. Some understanding of big data processing would also be beneficial.

Transcript

As this is an introductory level course, I will not deep dive on Hadoop distributions or the Map Reduce algorithms. Instead and to keep things at a higher level, let me first discuss EMRs simplicity. You can very quickly go from having no EMR resources within AWS to a full EMR cluster to process your data in just a few clicks in the console.  A few minutes later AWS EMR will have orchestrated all the resources required to carry out the required processing. 

This is a huge improvement over traditional provisioning. You save time and money as you're ready to do your own analysis in a matter of minutes, not hours or days. And if you have experience of creating a manual Hadoop cluster setup in the past, then you will know that this saves a huge amount of time when it comes to effort and provisioning your own hardware.

The pricing model is also very attractive. Traditionally, you would need to acquire all hardware and related equipment to support the maximum load for your analytics. You would need several servers to compose your cluster, plus implement highly available storage and networking, and that’s without the cost of energy and cooling within your own data center. 

However, with EMR the cost model is based on a pay as you go basis, meaning you will only pay for the provisioned resources when your cluster is running. You can choose among different EC2 instance types and sizes for different performance. If you need strong processing power for a small time, you can run 100 nodes for 1 hour instead of 10 nodes for 10 hours, and you pay the same price. You can also take benefit of EC2 pricing models such as Spot instances and EC2 reserved instances. 

Elasticity is another intrinsic cloud characteristic, as there is the ability to grow or shrink the resources whenever it’s required based on the demand. You can add or remove nodes on the fly. For example, if during the creation of your cluster you underestimated the resources required, you can add more core or task nodes to it. 

Security is an important factor for AWS and all of its services, and this is no different here for EMR. Your instances are secured by EC2 security groups, segmented by one security group for the master node, and others for the other node types which do not have external access by default. Of course, you can modify this behavior, but it's highly recommended not to open the world up to your cluster.  The data is also secured on Amazon S3 and you can enable auditing with CloudTrail. 

AWS has always taken great care to make services that easily integrate with other services to reduce as much as possible complexity into managing, operating, and getting the pieces together when you're building your applications. It's no different for EMR, where several storage services are available for you and easily integrated. S3 for example, is the most commonly used storage of its durability, low price, and infinite storage capacity. But you can also store your data in the cluster itself by taking benefit of the Hadoop file system. You may also store your results back to a DynamoDB table and if you want to leverage your existing BI tools and data warehousing infrastructure, you can make your results available to Redshift. Or save them for archival purposes with Amazon Glacier and also it's possible to put them into a relational database system.

About the Author
Students
229275
Labs
1
Courses
216
Learning Paths
172

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.