AWS re:Invent 2015 – Netflix and AWS

AWS-Netflix

How does Netflix operate on AWS?

Netflix has been on AWS since a devastating fire destroyed their own datacenter in 2010. By 2015, their Cloud migration was complete and, thanks to AWS, the scale they have achieved has been outstanding.

Josh Evans – Director of Operations Engineering at Netflix described the Netflix’s microservices architecture as a living organism, with critical components, internal flows, and failures. The infrastructure is composed of hundreds of completely decoupled and independent microservices involving thousands of daily production changes to many thousands of AWS instances.

Josh identifies two main challenges to achieving operational excellence:

Product innovation

In order to offer the best user experience – and therefore win their customers’ “moments of truth” (i.e., get them to watch more video content) – Netflix has to move and change fast.

Their innovation strategy involves the massive use of A/B tests on every facet of the product. During the last year, they ran more than 1,400 experiments (meaning at least 25 experiments running in parallel every day). Of course, the goal is to increase user engagement, and this explains why each user’s Netflix experience is sort of unique, both because of the customized recommendations they’re shown, and the unique combination of experiments.

Scale and complexity

Netflix currently handles hundreds of thousands of requests per second from about 60 countries. Their infrastructure runs multi-zone and multi-region, serving users from three different AWS regions. The only component running outside of AWS is their Netflix CDN, which currently covers about 37% of US Internet traffic.

Operations Engineering

Achieving operational excellence also involves a tough tradeoff between availability and rate of change (i.e. quality versus speed). Netflix is keen on trading some of their availability to enable fast change, and they approach the problem by means of continuous improvement of management, design, and function of operational environments. This kind of approach leads to greater quality, velocity, and competitive advantage.

The culture behind this choice can be summarized as “You build it, you run it”. It means 100% ownership, starting from designing, coding, building, testing and deploying…all the way to operating, configuring, monitoring, and responding (while doing it all globally!). They built their own software tools to enable this approach, like Spinnaker, Eureka, Hystrix, Atlas, and Vector (available on Github).

These tools are based on software engineering standards and advanced technologies:

  • Anomaly detection: to identify anomalous patterns on short windows of time series events.
  • Outlier detection and remediation: via unsupervised machine learning and clustering techniques.
  • Canary release process: new versions of the software are available to a small percentage of the traffic, with automatic canary analysis.
  • Unsupervised monitoring and decision making: take humans out of the equation and provide automatic alerts.
Netflix Fault Tolerance Traffic Map

Chaos Engineering

Another important component of Netflix’s approach is chaos engineering. Being aware that components are going to fail, they work hard on building confidence in the system’s capability to withstand turbulent conditions (directly in production). You can find their SimianArmy on Github. By using FIT (Fault-injection Testing) they can simulate service failures, both on an instance- and region-level.

Netflix Keystone

Director of Engineering at Netflix, Peter Bakas – after proudly taking a picture of the crowd – explained how Netflix handles data streams of up to 8 million events per second.

Keystone handles about 550 billion events every day (more than 8 million events per second) and manipulates more than one petabyte of data, composed of hundreds of event types. Their data pipeline solution is based on open source projects, such as Apache Kafka, Apache Chukwa, and Apache Samza, besides Docker and MySQL.Netflix Keystone Diagram

Netflix Core Team

Dave Hahn talked about how it feels and how it is possible for a few DevOps engineers to handle more than 37% of the US Internet. His team – the CORE team (Cloud Operations Reliability Engineering) – is responsible for crisis management, availability reporting, reliability best practices, AWS relationship, and operations education. It is mainly composed of crisis leaders and its goals are the following:

  • Protect customer experience. This is crucial at Netflix and is the key point of each operation.
  • Make failures unique. This means making errors happen only once, by identifying the real root of each problem and fixing it.
  • Achieve constant improvement. This takes a lot of individual effort and can be helped along by incident reviews and by encouraging honest and open feedback.

Dave described the DevOps culture they have built based on the 100% ownership concept and made easier by the many tools developed for software engineers to enable easy ownership, including service discovery, solid communication, automated recovery, continuous deployment, and data persistence.

Insights are a key factor as well: Netflix records about 2.5 billion metrics every day and needed in-house tools to help them visualize and analyze relevant patterns, via prediction and automation.

Avatar

Written by

Alex Casalboni

Alex is a Software Engineer with a great passion for music and web technologies. He's experienced in web development and software design, with a particular focus on frontend and UX.


Related Posts

Joe Nemer
Joe Nemer
— October 14, 2020

New Content: AWS Data Analytics – Specialty Certification, Azure AI-900 Certification, Plus New Learning Paths, Courses, Labs, and More

This month our Content Team released two big certification Learning Paths: the AWS Certified Data Analytics - Speciality, and the Azure AI Fundamentals AI-900. In total, we released four new Learning Paths, 16 courses, 24 assessments, and 11 labs.  New content on Cloud Academy At any ...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— September 15, 2020

New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More

This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— August 28, 2020

AWS Certification Practice Exam: What to Expect from Test Questions

If you’re building applications on the AWS cloud or looking to get started in cloud computing, certification is a way to build deep knowledge in key services unique to the AWS platform. AWS currently offers 12 certifications that cover major cloud roles including Solutions Architect, De...

Read more
  • AWS
  • AWS Certifications
Patrick Navarro
Patrick Navarro
— August 25, 2020

Overcoming Unprecedented Business Challenges with AWS

From auto-scaling applications with high availability to video conferencing that’s used by everyone, every day —  cloud technology has never been more popular or in-demand. But what does this mean for experienced cloud professionals and the challenges they face as they carve out a new p...

Read more
  • AWS
  • Cloud Adoption
  • digital transformation
Avatar
Andrew Larkin
— August 18, 2020

Constant Content: Cloud Academy’s Q3 2020 Roadmap

Hello —  Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...

Read more
  • alibaba
  • AWS
  • Azure
  • content roadmap
  • Content updates
  • DevOps
  • GCP
  • Google Cloud
  • New content
Alisha Reyes
Alisha Reyes
— August 5, 2020

New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More

This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Alisha Reyes
Alisha Reyes
— July 16, 2020

Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More

This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • OWASP
  • OWASP Top 10
  • Security
  • VPCs
Avatar
Cloud Academy Team
— July 9, 2020

Which Certifications Should I Get?

The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies. With all that in mind, the s...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— July 2, 2020

New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More

This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— June 19, 2020

Kickstart Your Tech Training With a Free Week on Cloud Academy

Are you looking to make a jump in your technical career? Want to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Kubernetes, Python, or another in-demand skill? Then you'll want to mark your calendar. Starting Monday, June 22 at 12:00 a.m. PDT (3:00 a.m. EDT), ...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house
Alisha Reyes
Alisha Reyes
— June 11, 2020

New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More

This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Rebecca Willis
Rebecca Willis
— June 3, 2020

Azure vs. AWS: Which Certification Provides the Brighter Future?

More and more companies are using cloud services, prompting more and more people to switch their current IT position to something cloud-related. The problem is most people only have that much time after work to learn new technologies, and there are plenty of cloud services that you can ...

Read more
  • AWS
  • Azure
  • certification