Learn how AOL was able to reduce the time and cost of processing massive amounts of clickstream data by leveraging AWS big data technologies (Amazon EMR)
“Migration” was a word that came up over and over again at last week’s AWS re:invent 2015, where Amazon announced a series of new features and services to make cloud migrations easier and more cost-effective.
One of the better-known companies currently using AWS is AOL. Durga Nemani, AOL Systems Architect, devoted his presentation to explain how AOL was able to reduce the time and cost of processing massive amounts of clickstream data by leveraging AWS big data technologies. AOL moved to AWS in 2014, migrating from a large (and expensive) in-house Hadoop cluster to an Amazon EMR (Elastic Map reduce) and Amazon S3 deployment for storing raw and processed data.
The main problem AOL’s data scientists had faced running a single in-house cluster, was the lack of scalability and flexibility. As their workload and dataset structures regularly changed, a single huge cluster was impossible to optimize. The “one size fits all” model simply did not work in this case.
AOL infrastructure powered by Amazon EMR
AOL now uses a hybrid approach: they process and store data using AWS services and then load their processed data into an in-house AOL database that is accessed by the AOL Reporting tool.
AOL uses Amazon S3 for storing raw and processed data, and Amazon EMR (Elastic Map Reduce) for running analytics tasks on top of a Hadoop cluster. Thanks to Amazon Web Services, AOL was able to abandon the single big cluster model in favor of several dozen EMR clusters of multiple sizes – each used when workload conditions justified it.
The ability to create EMR clusters on-demand allowed AOL to separate compute and storage jobs. Analyzed data could be retrieved using an AWS S3 client, instead of querying the Hadoop cluster and paying for a cluster running 24/7. The AOL team did a great job designing an EMR cluster orchestrator capable of creating a variable number of transient EMR clusters for processing the data collected during the day. Adopting the “Divide et impera” approach (Latin for “Divide and conquer”), the AOL orchestrator launches chains of EMR clusters, each one responsible for specific kind of jobs (Processing, Extracting, Loading, and Monitoring).
AOL also launches EMR clusters in parallel, to process the smallest data chunks possible in parallel and to reduce dependencies.
A typical AOL workflow consists of launching several Apache Hive and/or Apache PIG-equipped EMR clusters that read data from one S3 bucket and write to another. Up to 22 datasets are generated and 150 EMR clusters are launched during an “EMR pipeline”. All EMR clusters are checked by the AOL orchestrator that will also (re)launch new EMR clusters in case of error.
TCO analysis: how much does the EMR infrastructure cost?
AOL System Architects tried several infrastructure models and combinations to better understand the significance of service costs. In order to lower their infrastructure TCO, the AOL cluster orchestrator creates clusters that are able to complete assigned jobs in exactly 59 minutes. Why 59? Because any EC2 instance that’s part of an EMR cluster is billed in hourly increments, so terminating an EC2 instance soon after the 60-minute mark will incur two full hours of compute costs.
AOL also uses spot-instances for spinning up their EMR clusters, and they do it using multiple regions and Availability Zones; not only for High Availability but also to benefit from the lowest available spot prices (without competing against themselves).
Amazon EMR suggestions and best practices
Monitoring and security are important. Therefore, don’t forget to:
- Disable SSH access for EMR nodes.
- Use logs for checking what caused job failures and use Application IDs to narrow down your searches.
- Use the “Infrastructure as Code” pattern: Write configuration scripts for launching any EMR cluster and version it just like software source code.
- Enable SNS notifications for service failures.
- Use IAM Roles and Policies and enable Multi-Factor Authentication (MFA)
- Create multiple CLI profiles.
In order to better track your costs:
- Tag all AWS resources, so you’re able to understand the relevance of any expense item.
- Enable CloudTrail.
- Use EC2 spot instances.
- Create CloudWatch Billing Alarms.
If you’re interested to read on about Amazon EMR, I suggest taking a look at this article Amazon EMR: five ways to improve the way you use Hadoop.
New Content: AWS Terraform, Java Programming Lab Challenges, Azure DP-900 & DP-300 Certification Exam Prep, Plus Plenty More Amazon, Google, Microsoft, and Big Data Courses
This month our Content Team continues building the catalog of courses for everyone learning about AWS, GCP, and Microsoft Azure. In addition, this month’s updates include several Java programming lab challenges and a couple of courses on big data. In total, we released five new learning...
Where Should You Be Focusing Your AWS Security Efforts?
Another day, another re:Invent session! This time I listened to Stephen Schmidt’s session, “AWS Security: Where we've been, where we're going.” Amongst covering the highlights of AWS security during 2020, a number of newly added AWS features/services were discussed, including: AWS Audit...
AWS re:Invent: 2020 Keynote Top Highlights and More
We’ve gotten through the first five days of the special all-virtual 2020 edition of AWS re:Invent. It’s always a really exciting time for practitioners in the field to see what features and services AWS has cooked up for the year ahead. This year’s conference is a marathon and not a...
WARNING: Great Cloud Content Ahead
At Cloud Academy, content is at the heart of what we do. We work with the world’s leading cloud and operations teams to develop video courses and learning paths that accelerate teams and drive digital transformation. First and foremost, we listen to our customers’ needs and we stay ahea...
Excelling in AWS, Azure, and Beyond – How Danut Prisacaru Prepares for the Future
Meet Danut Prisacaru. Danut has been a Software Architect for the past 10 years and has been involved in Software Engineering for 30 years. He’s passionate about software and learning, and jokes that coding is basically the only thing he can do well (!). We think his enthusiasm shines t...
New Content: AWS Data Analytics – Specialty Certification, Azure AI-900 Certification, Plus New Learning Paths, Courses, Labs, and More
This month our Content Team released two big certification Learning Paths: the AWS Certified Data Analytics - Speciality, and the Azure AI Fundamentals AI-900. In total, we released four new Learning Paths, 16 courses, 24 assessments, and 11 labs. New content on Cloud Academy At any ...
New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More
This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...
AWS Certification Practice Exam: What to Expect from Test Questions
If you’re building applications on the AWS cloud or looking to get started in cloud computing, certification is a way to build deep knowledge in key services unique to the AWS platform. AWS currently offers 12 certifications that cover major cloud roles including Solutions Architect, De...
Overcoming Unprecedented Business Challenges with AWS
From auto-scaling applications with high availability to video conferencing that’s used by everyone, every day — cloud technology has never been more popular or in-demand. But what does this mean for experienced cloud professionals and the challenges they face as they carve out a new p...
Constant Content: Cloud Academy’s Q3 2020 Roadmap
Hello — Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...
New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More
This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...
Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More
This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...