Learn how AOL was able to reduce the time and cost of processing massive amounts of clickstream data by leveraging AWS big data technologies (Amazon EMR)
“Migration” was a word that came up over and over again at last week’s AWS re:invent 2015, where Amazon announced a series of new features and services to make cloud migrations easier and more cost-effective.
One of the better-known companies currently using AWS is AOL. Durga Nemani, AOL Systems Architect, devoted his presentation to explain how AOL was able to reduce the time and cost of processing massive amounts of clickstream data by leveraging AWS big data technologies. AOL moved to AWS in 2014, migrating from a large (and expensive) in-house Hadoop cluster to an Amazon EMR (Elastic Map reduce) and Amazon S3 deployment for storing raw and processed data.
The main problem AOL’s data scientists had faced running a single in-house cluster, was the lack of scalability and flexibility. As their workload and dataset structures regularly changed, a single huge cluster was impossible to optimize. The “one size fits all” model simply did not work in this case.
AOL infrastructure powered by Amazon EMR
AOL now uses a hybrid approach: they process and store data using AWS services and then load their processed data into an in-house AOL database that is accessed by the AOL Reporting tool.
AOL uses Amazon S3 for storing raw and processed data, and Amazon EMR (Elastic Map Reduce) for running analytics tasks on top of a Hadoop cluster. Thanks to Amazon Web Services, AOL was able to abandon the single big cluster model in favor of several dozen EMR clusters of multiple sizes – each used when workload conditions justified it.
The ability to create EMR clusters on-demand allowed AOL to separate compute and storage jobs. Analyzed data could be retrieved using an AWS S3 client, instead of querying the Hadoop cluster and paying for a cluster running 24/7. The AOL team did a great job designing an EMR cluster orchestrator capable of creating a variable number of transient EMR clusters for processing the data collected during the day. Adopting the “Divide et impera” approach (Latin for “Divide and conquer”), the AOL orchestrator launches chains of EMR clusters, each one responsible for specific kind of jobs (Processing, Extracting, Loading, and Monitoring).
AOL also launches EMR clusters in parallel, to process the smallest data chunks possible in parallel and to reduce dependencies.
A typical AOL workflow consists of launching several Apache Hive and/or Apache PIG-equipped EMR clusters that read data from one S3 bucket and write to another. Up to 22 datasets are generated and 150 EMR clusters are launched during an “EMR pipeline”. All EMR clusters are checked by the AOL orchestrator that will also (re)launch new EMR clusters in case of error.
TCO analysis: how much does the EMR infrastructure cost?
AOL System Architects tried several infrastructure models and combinations to better understand the significance of service costs. In order to lower their infrastructure TCO, the AOL cluster orchestrator creates clusters that are able to complete assigned jobs in exactly 59 minutes. Why 59? Because any EC2 instance that’s part of an EMR cluster is billed in hourly increments, so terminating an EC2 instance soon after the 60-minute mark will incur two full hours of compute costs.
AOL also uses spot-instances for spinning up their EMR clusters, and they do it using multiple regions and Availability Zones; not only for High Availability but also to benefit from the lowest available spot prices (without competing against themselves).
Amazon EMR suggestions and best practices
Monitoring and security are important. Therefore, don’t forget to:
- Disable SSH access for EMR nodes.
- Use logs for checking what caused job failures and use Application IDs to narrow down your searches.
- Use the “Infrastructure as Code” pattern: Write configuration scripts for launching any EMR cluster and version it just like software source code.
- Enable SNS notifications for service failures.
- Use IAM Roles and Policies and enable Multi-Factor Authentication (MFA)
- Create multiple CLI profiles.
In order to better track your costs:
- Tag all AWS resources, so you’re able to understand the relevance of any expense item.
- Enable CloudTrail.
- Use EC2 spot instances.
- Create CloudWatch Billing Alarms.
If you’re interested to read on about Amazon EMR, I suggest taking a look at this article Amazon EMR: five ways to improve the way you use Hadoop.
How to Unlock Complimentary Access to Cloud Academy
Are you looking to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Cloud Security, Python, Java, or another technical skill? Then you'll want to mark your calendars for August 23, 2019. Starting Friday at 12:00 a.m. PDT (3:00 a.m. EDT), Cloud Academy is offering c...
What Exactly Is a Cloud Architect and How Do You Become One?
One of the buzzwords surrounding the cloud that I'm sure you've heard is "Cloud Architect." In this article, I will outline my understanding of what a cloud architect does and I'll analyze the skills and certifications necessary to become one. I will also list some of the types of jobs ...
Boto: Using Python to Automate AWS Services
Boto allows you to write scripts to automate things like starting AWS EC2 instances Boto is a Python package that provides programmatic connectivity to Amazon Web Services (AWS). AWS offers a range of services for dynamically scaling servers including the core compute service, Elastic...
Content Roadmap: AZ-500, ITIL 4, MS-100, Google Cloud Associate Engineer, and More
Last month, Cloud Academy joined forces with QA, the UK’s largest B2B skills provider, and it put us in an excellent position to solve a massive skills gap problem. As a result of this collaboration, you will see our training library grow with additions from QA’s massive catalog of 500+...
DevSecOps: How to Secure DevOps Environments
Security has been a friction point when discussing DevOps. This stems from the assumption that DevOps teams move too fast to handle security concerns. This makes sense if Information Security (InfoSec) is separate from the DevOps value stream, or if development velocity exceeds the band...
Test Your Cloud Knowledge on AWS, Azure, or Google Cloud Platform
Cloud skills are in demand | In today's digital era, employers are constantly seeking skilled professionals with working knowledge of AWS, Azure, and Google Cloud Platform. According to the 2019 Trends in Cloud Transformation report by 451 Research: Business and IT transformations re...
Disadvantages of Cloud Computing
If you want to deliver digital services of any kind, you’ll need to estimate all types of resources, not the least of which are CPU, memory, storage, and network connectivity. Which resources you choose for your delivery — cloud-based or local — is up to you. But you’ll definitely want...
Google Cloud vs AWS: A Comparison (or can they be compared?)
The "Google Cloud vs AWS" argument used to be a common discussion among our members, but is this still really a thing? You may already know that there are three major players in the public cloud platforms arena: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP)...
Deployment Orchestration with AWS Elastic Beanstalk
If you're responsible for the development and deployment of web applications within your AWS environment for your organization, then it's likely you've heard of AWS Elastic Beanstalk. If you are new to this service, or simply need to know a bit more about the service and the benefits th...
How to Use & Install the AWS CLI
What is the AWS CLI? | The AWS Command Line Interface (CLI) is for managing your AWS services from a terminal session on your own client, allowing you to control and configure multiple AWS services and implement a level of automation. If you’ve been using AWS for some time and feel...
Cloud Academy’s Blog Digest: July 2019
July has been a very exciting month for us at Cloud Academy. On July 10, we officially joined forces with QA, the UK’s largest B2B skills provider (read the announcement). Over the coming weeks, you will see additions from QA’s massive catalog of 500+ certification courses and 1500+ ins...
AWS Fundamentals: Understanding Compute, Storage, Database, Networking & Security
If you are just starting out on your journey toward mastering AWS cloud computing, then your first stop should be to understand the AWS fundamentals. This will enable you to get a solid foundation to then expand your knowledge across the entire AWS service catalog. It can be both d...