Elastic Map Reduce (EMR)


Analytics Concepts
PREVIEW12m 48s
Start course
Duration1h 4m


In this course, we will explore the Analytics tools provided by AWS, including Elastic Map Reduce (EMR), Data Pipeline, Elasticsearch, Kinesis, Amazon Machine Learning and QuickSight which is still in preview mode.

We will start with an overview of Data Science and Analytics concepts to give beginners the context they need to be successful in the course. The second part of the course will focus on the AWS offering for Analytics, this means, how AWS structures its portfolio in the different processes and steps of big data and data processing.

As a fundamentals course, the requirements are kept simple so you can focus on understanding the different services from AWS. But, a basic understanding of the following topics is necessary:

  • As we are talking about technology and computing services, general IT knowledge is necessary, that is, the basics of programming logic, algorithms, and learning or working experience in the IT field.
  • We will give you an overview of data science concepts, but if these concepts are already familiar to you, it will make your journey smoother.
  • It is not mandatory but it would be helpful to have a general knowledge about AWS, most specifically about how to access your account and services such as S3 and EC2.

The following two courses from our portfolio can help you better understand the basics of AWS if you are just starting out:

AWS Technical Fundamentals
- AWS Networking Features Essential for a Solutions Architect

If you have thoughts or suggestions for this course, please contact Cloud Academy at support@cloudacademy.com.


Welcome to the AWS Analytics Fundamentals course. In this video, we will cover the Amazon Elastic MapReduce service. In the end of this video, you have seen the general features and concepts behind EMR, as well know how to provision a cluster.

Amazon MapReduce service, or EMR, is an Amazon managed service to process and analyze vast amounts of data. EMR is based on the popular and solid Apache Hadoop framework, an open source distributed processing framework intended for big data processing. Organizations and companies can have great benefit in using Amazon EMR because it abstracts and reduces the complexity of infrastructure layer.

If you have experience with Hadoop infrastructure, you might understand what I'm saying. From the cluster setup to keep a healthy cluster, the efforts involved are not trivial. So what Amazon did was to encapsulate all the infrastructure of the Hadoop framework into an integrated environment so you can launch a cluster in minutes and focus on the real important part, which is not managing infrastructure but getting your data processed according to your needs.

Amazon EMR securely and reliably handles your big data use cases, including log analysis, web indexing, data warehousing, machine learning, financial analyzing, scientific simulation, and bioinformatics. Amazon EMR takes benefit of EC2 instances specially configured with the Hadoop framework to deliver a petabyte scale processing power. Besides the open source framework, the EMR distribution is also available.

Now we are going to see a little bit about the other frameworks that EMR also supports. It supports also Spark and Presto, also well-known frameworks in the big data area. Apache Spark on Hadoop is natively supported in Amazon EMR and you can quickly and easily create manageable Apache Spark clusters from the management console or API. Presto is also an open source distributed SQL query engine optimized for low latency, ad-hoc analysis of data. You can quickly and easily create managed Presto cluster from AWS console. Now we are going to talk about the general characteristics from EMR.

As a beginner level course, we will not talk deeply about distribution from Hadoop or the MapReduce algorithms. The first characteristic worth a mention is its simplicity. Yes, that's really very simple. From zero to complete cluster to process data, it takes a few clicks in the console and a few minutes to AWS EMR to orchestrate all the needed resources. This is a huge improvement over traditional provisioning. You save time and money as you're ready to do your own analysis in a matter minutes, not hours or days. And as we said before, if you have already done manually Hadoop clusters set up, you know that this really matters.

The pricing model is also very attractive. Traditionally, you would need to acquire all hardware and related equipment to support maximal load after analytics. You need several servers to compose your cluster, plus all high available storage and networking, and we are not even talking about energy or cooling for devices. This made Hadoop and other distributed environments hard to beat the cost restrictions on the organizations. With EMR, you pay only when you are using your cluster, so when you stop it, you stop paying for it. Elasticity is another intrinsic cloud characteristic as the ability to grow or shrink the resources whenever needed by your needs. Besides a reliable setup for environment, EMR will indicate a cluster status on its different cluster states to inform you of the current situation from your environment, and also when you send a job for execution it provides a status for each step.

Security is an important factor for AWS and all services, and this is not different here for EMR. Your instances are secured by EC2 security groups segmented by one security group for the master node, and other for the slaves without external access per default. For sure, you can add it and change this behavior, but it's highly recommended not to open to the world your cluster. The data is also secured on S3 and you can enable auditing with CloudTrail. As we said before, elasticity's the characteristic to grow and shrink resources on demand to meet your needs.

EMR is highly elastic, as you can add or remove nodes on the fly. For example, if the creation time you under-dimensioned your cluster, you can add more core or task node to it. In part, EMR is that as kernel hosts persistent data in HDFS, the Hadoop file system, and cannot be removed. Kernels should be reserved for the capacity that's required until your cluster completes, as task nodes can be added and removed and don't contain HDFS, they are ideal for capacity that's only needed on a temporary basis. So in short, when you dimension your cluster, you need to think about the amount of data you want to store on the core nodes, which run HDFS, the file system from Hadoop. For processing, you can add as many task nodes as you want, and if you over-dimension your cluster, you can reduce the number of task nodes on the fly without interrupting the running jobs.

There are two main options for adding or removing capacity. We can deploy multiple clusters if you need more capacity. You can easily launch a new cluster and terminate it when you no longer need it. There is no limit to how many clusters you can have. You may want to use multiple cluster if you have multiple users or different jobs to run. For example, you can store your input data in S3 and launch one cluster for each application that needs to process this data. One cluster might be optimized for CPU and another one for storage, and you can also resize the running cluster. With Amazon EMR, it's easy to resize a running cluster. You may want to resize a cluster to temporarily add more processing power or to shrink your cluster to save money.

Without counting the great cost saving by reducing the operations and setup costs, EMR also has a very attractive cost model based as a pay as you go service. You will pay only for the provisioned resources when your cluster is running. And you can choose among different EC2 instance types and sizes. If you need strong processing power for a small time, you can run 100 nodes for 1 hour instead of 10 nodes for 10 hours, and you pay the same price. You can also take benefit of EC2 pricing special instances, like Spot Marketing instances and EC2 reserved instances. Use Spot if you need to get the best price to process your data, no matter if the running time is affected by Spot prices. Use reserved instances if you need this table and constant running cluster.

AWS has always a great care to make services that easily integrate with other services to reduce as much as possible complexity into managing, operating, and getting the pieces together when you're building your applications. It's not different for EMR, where several storage packages are available for you and easily integrated. S3 for example, is the most commonly used storage of its durability, low price, and infinite storage capacity. But you can also store your data in the cluster itself by taking benefit of the [inaudible 00:06:49] Hadoop file system. You may also store your results back to a DynamoDB table and if you want to leverage your existing BI tools and data warehousing infrastructure, you can make your results available to Redshift. Or save them for archival with Glacier and also it's possible to put them into a relational database system.

Now let's understand how we make EMR work. First of all, as we stressed in the previous videos, you need a research problem, as well the data to analyze, and for EMR specifically, we need to assign it code developed in the supported language in order to get the answers to our questions. So when you have all this, you must upload to S3 the data and the code. Then after you set up your cluster, using the console or API calls, we have to submit the job to our newly created cluster. We can check the metrics on CloudWatch to see what's going on in the cluster, and in the end, the result can be saved back to S3 or with other tools in the middle, sent to Redshift, or DynamoDB.

Now we are going to create a cluster as well experiment a little bit the power from the Hadoop framework. Okay, so now we are going to create our EMR cluster. First we log into AWS console, then we go to EMR, and here we are in the get started page. We are in the Oregon region, and here I have no cluster created. So we will need to create a cluster. Here we select first a naming for the cluster. We can enable logging. It's recommended that we keep the logging enabled. We create a standalone cluster. This means that the cluster we will leave after we submit the first job, and there's a step execution. What does it mean? I can create the cluster only to execute a given task, and then automatically destroys it. The software configuration, we use the Amazon not the MapR. The latest version we keep it as default here. The default instance type is the m3.xlarge, as we are testing we will submit one m1.large for cost saving. We will choose three nodes, always we'll have one master and in this case two core nodes. Here we can only add core and master nodes.

When we resize a cluster, then we have the ability to add also task nodes, as they are only processing nodes. Key pairs, if you want to access our instances. We have no key pairs for this as for testing we do not access the instances. And then we hit create cluster. This process can take several minutes as the cluster is running. In the meanwhile, we are going to see other clusters which are already running on another region.

Here we can see that we have two terminated clusters and one running cluster. After clicking on it in the cluster list, here we have the details from this cluster. The ID, which is uniquely identified by AWS, the creation date, the time it's running already, the information about the software status, and here we can see the monitoring from CloudWatch. Is idle is now is zero, so it means the cluster is running a job. Here we can see other metrics. We'll not enter in details as they are quite complex metrics here.

In the steps, we can see what's running. For example, here I'm running a custom JAR which makes CloudFront logs processing. So we have a running job here. We'll not enter into details for the jobs here as they require first further understanding for the MapReduce process. We will then give a look in our cluster if it's ready in Oregon. Back to Oregon, we can see that our cluster is ready. When it's ready, it's always in the waiting state. It's a green waiting state. There is no running job, just a Hadoop debugging, which is a default step, which is ready but there is no other tasks included here. And that's it. In a couple clicks, the cluster was ready.

Let's just go back quickly to North Virginia where we had some jobs executing. It's also ready now. This means the job has finished. This MapReduce task run by this JAR has to go to get logs from CloudFront and process them to extract. For example, the top IP addresses.

Let's give a look into output. It should have a lot of output to one of our buckets, the CloudAcademy-test.com/emr. Now we're going to go to S3 to see what was generated by this MapReduce job. Inside the EMR folder, temp-output-cf-reports, let's get an overview client-ip-request-count. Let's open this file. And here we can see that it got a list of IPs with the count of accesses. So the MapReduce was successful.

This was just a simple example. And that was our goal just to show basic overview, how we create a cluster, and how we check the result from the steps submitted. I hope you have enjoyed and learned a little bit, and see you in the next video. Bye.

About the Author

Fernando has a solid experience with infrastructure and applications management on heterogeneous environments, working with Cloud-based solutions since the beginning of the Cloud revolution. Currently at Beck et al. Services, Fernando helps enterprises to make a safe journey to the Cloud, architecting and migrating workloads from on-premises to public Cloud providers.