AWS re:Invent 2015 – Netflix and AWS

AWS-Netflix

How does Netflix operate on AWS?

Netflix has been on AWS since a devastating fire destroyed their own datacenter in 2010. By 2015, their Cloud migration was complete and, thanks to AWS, the scale they have achieved has been outstanding.

Josh Evans – Director of Operations Engineering at Netflix described the Netflix’s microservices architecture as a living organism, with critical components, internal flows, and failures. The infrastructure is composed of hundreds of completely decoupled and independent microservices involving thousands of daily production changes to many thousands of AWS instances.

Josh identifies two main challenges to achieving operational excellence:

Product innovation

In order to offer the best user experience – and therefore win their customers’ “moments of truth” (i.e., get them to watch more video content) – Netflix has to move and change fast.

Their innovation strategy involves the massive use of A/B tests on every facet of the product. During the last year, they ran more than 1,400 experiments (meaning at least 25 experiments running in parallel every day). Of course, the goal is to increase user engagement, and this explains why each user’s Netflix experience is sort of unique, both because of the customized recommendations they’re shown, and the unique combination of experiments.

Scale and complexity

Netflix currently handles hundreds of thousands of requests per second from about 60 countries. Their infrastructure runs multi-zone and multi-region, serving users from three different AWS regions. The only component running outside of AWS is their Netflix CDN, which currently covers about 37% of US Internet traffic.

Operations Engineering

Achieving operational excellence also involves a tough tradeoff between availability and rate of change (i.e. quality versus speed). Netflix is keen on trading some of their availability to enable fast change, and they approach the problem by means of continuous improvement of management, design, and function of operational environments. This kind of approach leads to greater quality, velocity, and competitive advantage.

The culture behind this choice can be summarized as “You build it, you run it”. It means 100% ownership, starting from designing, coding, building, testing and deploying…all the way to operating, configuring, monitoring, and responding (while doing it all globally!). They built their own software tools to enable this approach, like Spinnaker, Eureka, Hystrix, Atlas, and Vector (available on Github).

These tools are based on software engineering standards and advanced technologies:

  • Anomaly detection: to identify anomalous patterns on short windows of time series events.
  • Outlier detection and remediation: via unsupervised machine learning and clustering techniques.
  • Canary release process: new versions of the software are available to a small percentage of the traffic, with automatic canary analysis.
  • Unsupervised monitoring and decision making: take humans out of the equation and provide automatic alerts.
Netflix Fault Tolerance Traffic Map

Chaos Engineering

Another important component of Netflix’s approach is chaos engineering. Being aware that components are going to fail, they work hard on building confidence in the system’s capability to withstand turbulent conditions (directly in production). You can find their SimianArmy on Github. By using FIT (Fault-injection Testing) they can simulate service failures, both on an instance- and region-level.

Netflix Keystone

Director of Engineering at Netflix, Peter Bakas – after proudly taking a picture of the crowd – explained how Netflix handles data streams of up to 8 million events per second.

Keystone handles about 550 billion events every day (more than 8 million events per second) and manipulates more than one petabyte of data, composed of hundreds of event types. Their data pipeline solution is based on open source projects, such as Apache Kafka, Apache Chukwa, and Apache Samza, besides Docker and MySQL.Netflix Keystone Diagram

Netflix Core Team

Dave Hahn talked about how it feels and how it is possible for a few DevOps engineers to handle more than 37% of the US Internet. His team – the CORE team (Cloud Operations Reliability Engineering) – is responsible for crisis management, availability reporting, reliability best practices, AWS relationship, and operations education. It is mainly composed of crisis leaders and its goals are the following:

  • Protect customer experience. This is crucial at Netflix and is the key point of each operation.
  • Make failures unique. This means making errors happen only once, by identifying the real root of each problem and fixing it.
  • Achieve constant improvement. This takes a lot of individual effort and can be helped along by incident reviews and by encouraging honest and open feedback.

Dave described the DevOps culture they have built based on the 100% ownership concept and made easier by the many tools developed for software engineers to enable easy ownership, including service discovery, solid communication, automated recovery, continuous deployment, and data persistence.

Insights are a key factor as well: Netflix records about 2.5 billion metrics every day and needed in-house tools to help them visualize and analyze relevant patterns, via prediction and automation.

Avatar

Written by

Alex Casalboni

Alex is a Software Engineer with a great passion for music and web technologies. He's experienced in web development and software design, with a particular focus on frontend and UX.


Related Posts

Valery Calderón Briz
Valery Calderón Briz
— October 22, 2019

How to Go Serverless Like a Pro

So, no servers? Yeah, I checked and there are definitely no servers. Well...the cloud service providers do need servers to host and run the code, but we don’t have to worry about it. Which operating system to use, how and when to run the instances, the scalability, and all the arch...

Read more
  • AWS
  • Lambda
  • Serverless
Avatar
Stuart Scott
— October 16, 2019

AWS Security: Bastion Host, NAT instances and VPC Peering

Effective security requires close control over your data and resources. Bastion hosts, NAT instances, and VPC peering can help you secure your AWS infrastructure. Welcome to part four of my AWS Security overview. In part three, we looked at network security at the subnet level. This ti...

Read more
  • AWS
Avatar
Sudhi Seshachala
— October 9, 2019

Top 13 Amazon Virtual Private Cloud (VPC) Best Practices

Amazon Virtual Private Cloud (VPC) brings a host of advantages to the table, including static private IP addresses, Elastic Network Interfaces, secure bastion host setup, DHCP options, Advanced Network Access Control, predictable internal IP ranges, VPN connectivity, movement of interna...

Read more
  • AWS
  • best practices
  • VPC
Avatar
Stuart Scott
— October 2, 2019

Big Changes to the AWS Certification Exams

With AWS re:Invent 2019 just around the corner, we can expect some early announcements to trickle through with upcoming features and services. However, AWS has just announced some big changes to their certification exams. So what’s changing and what’s new? There is a brand NEW ...

Read more
  • AWS
  • Certifications
Alisha Reyes
Alisha Reyes
— October 1, 2019

New on Cloud Academy: ITIL® 4, Microsoft 365 Tenant, Jenkins, TOGAF® 9.1, and more

At Cloud Academy, we're always striving to make improvements to our training platform. Based on your feedback, we released some new features to help make it easier for you to continue studying. These new features allow you to: Remove content from “Continue Studying” section Disc...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • ITIL® 4
  • Jenkins
  • Microsoft 365 Tenant
  • New content
  • Product Feature
  • Python programming
  • TOGAF® 9.1
Avatar
Stuart Scott
— September 27, 2019

AWS Security Groups: Instance Level Security

Instance security requires that you fully understand AWS security groups, along with patching responsibility, key pairs, and various tenancy options. As a precursor to this post, you should have a thorough understanding of the AWS Shared Responsibility Model before moving onto discussi...

Read more
  • AWS
  • instance security
  • Security
  • security groups
Avatar
Jeremy Cook
— September 17, 2019

Cloud Migration Risks & Benefits

If you’re like most businesses, you already have at least one workload running in the cloud. However, that doesn’t mean that cloud migration is right for everyone. While cloud environments are generally scalable, reliable, and highly available, those won’t be the only considerations dri...

Read more
  • AWS
  • Azure
  • Cloud Migration
Joe Nemer
Joe Nemer
— September 12, 2019

Real-Time Application Monitoring with Amazon Kinesis

Amazon Kinesis is a real-time data streaming service that makes it easy to collect, process, and analyze data so you can get quick insights and react as fast as possible to new information.  With Amazon Kinesis you can ingest real-time data such as application logs, website clickstre...

Read more
  • amazon kinesis
  • AWS
  • Stream Analytics
  • Streaming data
Joe Nemer
Joe Nemer
— September 6, 2019

Google Cloud Functions vs. AWS Lambda: The Fight for Serverless Cloud Domination

Serverless computing: What is it and why is it important? A quick background The general concept of serverless computing was introduced to the market by Amazon Web Services (AWS) around 2014 with the release of AWS Lambda. As we know, cloud computing has made it possible for users to ...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
Joe Nemer
Joe Nemer
— September 3, 2019

Google Vision vs. Amazon Rekognition: A Vendor-Neutral Comparison

Google Cloud Vision and Amazon Rekognition offer a broad spectrum of solutions, some of which are comparable in terms of functional details, quality, performance, and costs. This post is a fact-based comparative analysis on Google Vision vs. Amazon Rekognition and will focus on the tech...

Read more
  • Amazon Rekognition
  • AWS
  • Google Cloud Platform
  • Google Vision
Alisha Reyes
Alisha Reyes
— August 30, 2019

New on Cloud Academy: CISSP, AWS, Azure, & DevOps Labs, Python for Beginners, and more…

As Hurricane Dorian intensifies, it looks like Floridians across the entire state might have to hunker down for another big one. If you've gone through a hurricane, you know that preparing for one is no joke. You'll need a survival kit with plenty of water, flashlights, batteries, and n...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • New content
  • Product Feature
  • Python programming
Joe Nemer
Joe Nemer
— August 27, 2019

Amazon Route 53: Why You Should Consider DNS Migration

What Amazon Route 53 brings to the DNS table Amazon Route 53 is a highly available and scalable Domain Name System (DNS) service offered by AWS. It is named by the TCP or UDP port 53, which is where DNS server requests are addressed. Like any DNS service, Route 53 handles domain regist...

Read more
  • Amazon
  • AWS
  • Cloud Migration
  • DNS
  • Route 53