AOL use-case: how to exploit the power of Amazon EMR

Learn how AOL was able to reduce the time and cost of processing massive amounts of clickstream data by leveraging AWS big data technologies (Amazon EMR)

“Migration” was a word that came up over and over again at last week’s AWS re:invent 2015, where Amazon announced a series of new features and services to make cloud migrations easier and more cost-effective.
One of the better-known companies currently using AWS is AOL. Durga Nemani, AOL Systems Architect, devoted his presentation to explaining how AOL was able to reduce the time and cost of processing massive amounts of clickstream data by leveraging AWS big data technologies. AOL moved to AWS in 2014, migrating from a large (and expensive) in-house Hadoop cluster to an Amazon EMR (Elastic Map reduce) and Amazon S3 deployment for storing raw and processed data.
The main problem AOL’s data scientists had faced running a single in-house cluster, was the lack of scalability and flexibility. As their workload and dataset structures regularly changed, a single huge cluster was impossible to optimize. The “one size fits all” model simply did not work in this case.

AOL infrastructure powered by Amazon EMR

AOL now uses a hybrid approach: they process and store data using AWS services and then load their processed data into an in-house AOL database that is accessed by the AOL Reporting tool.
AOL uses Amazon S3 for storing raw and processed data, and Amazon EMR (Elastic Map Reduce) for running analytics tasks on top of a Hadoop cluster. Thanks to Amazon Web Services, AOL was able to abandon the single big cluster model in favor of several dozen EMR clusters of multiple sizes – each used when workload conditions justified it.
The ability to create EMR clusters on-demand allowed AOL to separate compute and storage jobs. Analyzed data could be retrieved using an AWS S3 client, instead of querying the Hadoop cluster and paying for a cluster running 24/7. The AOL team did a great job designing an EMR cluster orchestrator capable of creating a variable number of transient EMR clusters for processing the data collected during the day. Adopting the “Divide et impera” approach (Latin for “Divide and conquer”), the AOL orchestrator launches chains of EMR clusters, each one responsible for specific kind of jobs (Processing, Extracting, Loading, and Monitoring).
AOL also launches EMR clusters in parallel, to process the smallest data chunks possible in parallel and to reduce dependencies.
A typical AOL workflow consists of launching several Apache Hive and/or Apache PIG-equipped EMR clusters that read data from one S3 bucket and write to another. Up to 22 datasets are generated and 150 EMR clusters are launched during an “EMR pipeline”. All EMR clusters are checked by the AOL orchestrator that will also (re)launch new EMR clusters in case of error.

TCO analysis: how much does the EMR infrastructure cost?

AOL System Architects tried several infrastructure models and combinations to better understand the significance of service costs. In order to lower their infrastructure TCO, the AOL cluster orchestrator creates clusters that are able to complete assigned jobs in exactly 59 minutes. Why 59? Because any EC2 instance that’s part of an EMR cluster is billed in hourly increments, so terminating an EC2 instance soon after the 60 minute mark will incur two full hours of compute costs.
AOL also uses spot-instances for spinning up their EMR clusters, and they do it using multiple regions and Availability Zones; not only for High Availability, but also to benefit from the lowest available spot prices (without competing against themselves).

Amazon EMR suggestions and best practices

Monitoring and security are important. Therefore, don’t forget to:

  • Disable SSH access for EMR nodes.
  • Use logs for checking what caused job failures and use Application IDs to narrow down your searches.
  • Use the “Infrastructure as Code” pattern: Write configuration scripts for launching any EMR cluster and version it just like software source code.
  • Enable SNS notifications for service failures.
  • Use IAM Roles and Policies and enable Multi-Factor Authentication (MFA)
  • Create multiple CLI profiles.

In order to better track your costs:

  • Tag all AWS resources, so you’re able to understand the relevance of any expense item.
  • Enable CloudTrail.
  • Use EC2 spot instances.
  • Create CloudWatch Billing Alarms.

Also: read Amazon EMR: five ways to improve the way you use Hadoop.

Written by

Antonio Angelino

Antonio is an IT Manager and a software and infrastructure Engineer with more than 10 years of experience designing, implementing and deploying complex webapps using the best available technologies.

Related Posts

— February 11, 2019

WaitCondition Controls the Pace of AWS CloudFormation Templates

AWS's WaitCondition can be used with CloudFormation templates to ensure required resources are running.As you may already be aware, AWS CloudFormation is used for infrastructure automation by allowing you to write JSON templates to automatically install, configure, and bootstrap your ...

Read more
  • AWS
— January 24, 2019

The 9 AWS Certifications: Which is Right for You and Your Team?

As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in the cloud.As the market leader and most mature p...

Read more
  • AWS
  • AWS certifications
— November 28, 2018

Two New EC2 Instance Types Announced at AWS re:Invent 2018 – Monday Night Live

The announcements at re:Invent just keep on coming! Let’s look at what benefits these two new EC2 instance types offer and how these two new instances could be of benefit to you. If you're not too familiar with Amazon EC2, you might want to familiarize yourself by creating your first Am...

Read more
  • AWS
  • EC2
  • re:Invent 2018
— November 21, 2018

Google Cloud Certification: Preparation and Prerequisites

Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2018, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the first time. In t...

Read more
  • AWS
  • Azure
  • Google Cloud
Khash Nakhostin
— November 13, 2018

Understanding AWS VPC Egress Filtering Methods

In order to understand AWS VPC egress filtering methods, you first need to understand that security on AWS is governed by a shared responsibility model where both vendor and subscriber have various operational responsibilities. AWS assumes responsibility for the underlying infrastructur...

Read more
  • Aviatrix
  • AWS
  • VPC
— November 10, 2018

S3 FTP: Build a Reliable and Inexpensive FTP Server Using Amazon’s S3

Is it possible to create an S3 FTP file backup/transfer solution, minimizing associated file storage and capacity planning administration headache?FTP (File Transfer Protocol) is a fast and convenient way to transfer large files over the Internet. You might, at some point, have conf...

Read more
  • Amazon S3
  • AWS
— October 18, 2018

Microservices Architecture: Advantages and Drawbacks

Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs).Microservices have become increasingly popular over the past few years. The modular architectural style,...

Read more
  • AWS
  • Microservices
— October 2, 2018

What Are Best Practices for Tagging AWS Resources?

There are many use cases for tags, but what are the best practices for tagging AWS resources? In order for your organization to effectively manage resources (and your monthly AWS bill), you need to implement and adopt a thoughtful tagging strategy that makes sense for your business. The...

Read more
  • AWS
  • cost optimization
— September 26, 2018

How to Optimize Amazon S3 Performance

Amazon S3 is the most common storage options for many organizations, being object storage it is used for a wide variety of data types, from the smallest objects to huge datasets. All in all, Amazon S3 is a great service to store a wide scope of data types in a highly available and resil...

Read more
  • Amazon S3
  • AWS
— September 18, 2018

How to Optimize Cloud Costs with Spot Instances: New on Cloud Academy

One of the main promises of cloud computing is access to nearly endless capacity. However, it doesn’t come cheap. With the introduction of Spot Instances for Amazon Web Services’ Elastic Compute Cloud (AWS EC2) in 2009, spot instances have been a way for major cloud providers to sell sp...

Read more
  • AWS
  • Azure
  • Google Cloud
  • SpotInst
— August 23, 2018

What are the Benefits of Machine Learning in the Cloud?

A Comparison of Machine Learning Services on AWS, Azure, and Google CloudArtificial intelligence and machine learning are steadily making their way into enterprise applications in areas such as customer support, fraud detection, and business intelligence. There is every reason to beli...

Read more
  • AWS
  • Azure
  • Google Cloud
  • Machine Learning
— August 17, 2018

How to Use AWS CLI

The AWS Command Line Interface (CLI) is for managing your AWS services from a terminal session on your own client, allowing you to control and configure multiple AWS services.So you’ve been using AWS for awhile and finally feel comfortable clicking your way through all the services....

Read more
  • AWS