Learn how AOL was able to reduce the time and cost of processing massive amounts of clickstream data by leveraging AWS big data technologies (Amazon EMR)
“Migration” was a word that came up over and over again at last week’s AWS re:invent 2015, where Amazon announced a series of new features and services to make cloud migrations easier and more cost-effective.
One of the better-known companies currently using AWS is AOL. Durga Nemani, AOL Systems Architect, devoted his presentation to explain how AOL was able to reduce the time and cost of processing massive amounts of clickstream data by leveraging AWS big data technologies. AOL moved to AWS in 2014, migrating from a large (and expensive) in-house Hadoop cluster to an Amazon EMR (Elastic Map reduce) and Amazon S3 deployment for storing raw and processed data.
The main problem AOL’s data scientists had faced running a single in-house cluster, was the lack of scalability and flexibility. As their workload and dataset structures regularly changed, a single huge cluster was impossible to optimize. The “one size fits all” model simply did not work in this case.
AOL infrastructure powered by Amazon EMR
AOL now uses a hybrid approach: they process and store data using AWS services and then load their processed data into an in-house AOL database that is accessed by the AOL Reporting tool.
AOL uses Amazon S3 for storing raw and processed data, and Amazon EMR (Elastic Map Reduce) for running analytics tasks on top of a Hadoop cluster. Thanks to Amazon Web Services, AOL was able to abandon the single big cluster model in favor of several dozen EMR clusters of multiple sizes – each used when workload conditions justified it.
The ability to create EMR clusters on-demand allowed AOL to separate compute and storage jobs. Analyzed data could be retrieved using an AWS S3 client, instead of querying the Hadoop cluster and paying for a cluster running 24/7. The AOL team did a great job designing an EMR cluster orchestrator capable of creating a variable number of transient EMR clusters for processing the data collected during the day. Adopting the “Divide et impera” approach (Latin for “Divide and conquer”), the AOL orchestrator launches chains of EMR clusters, each one responsible for specific kind of jobs (Processing, Extracting, Loading, and Monitoring).
AOL also launches EMR clusters in parallel, to process the smallest data chunks possible in parallel and to reduce dependencies.
A typical AOL workflow consists of launching several Apache Hive and/or Apache PIG-equipped EMR clusters that read data from one S3 bucket and write to another. Up to 22 datasets are generated and 150 EMR clusters are launched during an “EMR pipeline”. All EMR clusters are checked by the AOL orchestrator that will also (re)launch new EMR clusters in case of error.
TCO analysis: how much does the EMR infrastructure cost?
AOL System Architects tried several infrastructure models and combinations to better understand the significance of service costs. In order to lower their infrastructure TCO, the AOL cluster orchestrator creates clusters that are able to complete assigned jobs in exactly 59 minutes. Why 59? Because any EC2 instance that’s part of an EMR cluster is billed in hourly increments, so terminating an EC2 instance soon after the 60-minute mark will incur two full hours of compute costs.
AOL also uses spot-instances for spinning up their EMR clusters, and they do it using multiple regions and Availability Zones; not only for High Availability but also to benefit from the lowest available spot prices (without competing against themselves).
Amazon EMR suggestions and best practices
Monitoring and security are important. Therefore, don’t forget to:
- Disable SSH access for EMR nodes.
- Use logs for checking what caused job failures and use Application IDs to narrow down your searches.
- Use the “Infrastructure as Code” pattern: Write configuration scripts for launching any EMR cluster and version it just like software source code.
- Enable SNS notifications for service failures.
- Use IAM Roles and Policies and enable Multi-Factor Authentication (MFA)
- Create multiple CLI profiles.
In order to better track your costs:
- Tag all AWS resources, so you’re able to understand the relevance of any expense item.
- Enable CloudTrail.
- Use EC2 spot instances.
- Create CloudWatch Billing Alarms.
If you’re interested to read on about Amazon EMR, I suggest taking a look at this article Amazon EMR: five ways to improve the way you use Hadoop.
Which Certifications Should I Get?
As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...
New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More
This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...
Kickstart Your Tech Training With a Free Week on Cloud Academy
Are you looking to make a jump in your technical career? Want to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Kubernetes, Python, or another in-demand skill?Then you'll want to mark your calendar. Starting Monday, June 22 at 12:00 a.m. PDT (3:00 a.m. EDT), ...
New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More
This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...
Azure vs. AWS: Which Certification Provides the Brighter Future?
More and more companies are using cloud services, prompting more and more people to switch their current IT position to something cloud-related. The problem is most people only have that much time after work to learn new technologies, and there are plenty of cloud services that you can ...
Blog Digest: 5 Reasons to Get AWS Certified, OWASP Top 10, Getting Started with VPCs, Top 10 Soft Skills, and More
Thank you for being a valued member of our community! We recently sent out a short survey to understand what type of content you would like us to add to Cloud Academy, and we want to thank everyone who gave us their input. If you would like to complete the survey, it's not too late. It ...
New Content: Alibaba, Azure Cert Prep: AI-100, AZ-104, AZ-204 & AZ-400, Amazon Athena Playground, Google Cloud Developer Challenge, and much more
This month, our Content Team released 8 new learning paths, 4 courses, 7 labs in real cloud environments, and 4 new knowledge check assessments. Not only that, but we introduced our very first course on Alibaba Cloud, and our expert instructors are working 'round the clock to create 6 n...
Top 5 Reasons to Get AWS Certified Right Now
Cloud computing trends are on the rise and have been for some time already. Fortunately, it’s never too late to start learning cloud computing. Skills like AWS and others associated with cloud computing are in high demand because cloud technologies have become crucial for many businesse...
Introducing Our Newest Lab Environments: Lab Playgrounds
Want to train in a real cloud environment, but feel slowed down by spinning up your own deployments? When you consider security or pricing costs, it can be costly and challenging to get up to speed quickly for self-training. To solve this problem, Cloud Academy created a new suite of la...
Blog Digest: AWS Breaking News, Azure DevOps, AWS Study Guide, 8 Ways to Prevent a Ransomware Attack, and More
New articles by topicAWS Azure Data Science Google Cloud Cloud Adoption Platform Updates & New Content Security Women in TechAWSBreaking News: All AWS Certification Exams Now Available Online As an Advanced AWS Technology Partner, C...
AWS Certified Solutions Architect Associate: A Study Guide
Want to take a really impactful step in your technical career? Explore the AWS Solutions Architect Associate certificate. Its new version (SAA-C02) was released on March 23, 2020, though you can still take SAA-C01 through July 1, 2020. This post will focus on version SAA-C02.The AWS...
New on Cloud Academy: AWS Solutions Architect Exam Prep, Azure Courses, GCP Engineer Exam Prep, Programming, and More
Free content on Cloud Academy More and more customers are relying on our technology and content to keep upskilling their people in these months, and we are doing our best to keep supporting them. While the world fights the COVID-19 pandemic, we wanted to make a small contribution to he...