How does Netflix operate on AWS?
Netflix has been on AWS since a devastating fire destroyed their own datacenter in 2010. By 2015, their Cloud migration was complete and, thanks to AWS, the scale they have achieved has been outstanding.
Josh Evans – Director of Operations Engineering at Netflix described the Netflix’s microservices architecture as a living organism, with critical components, internal flows, and failures. The infrastructure is composed of hundreds of completely decoupled and independent microservices involving thousands of daily production changes to many thousands of AWS instances.
Josh identifies two main challenges to achieving operational excellence:
In order to offer the best user experience – and therefore win their customers’ “moments of truth” (i.e., get them to watch more video content) – Netflix has to move and change fast.
Their innovation strategy involves the massive use of A/B tests on every facet of the product. During the last year, they ran more than 1,400 experiments (meaning at least 25 experiments running in parallel every day). Of course, the goal is to increase user engagement, and this explains why each user’s Netflix experience is sort of unique, both because of the customized recommendations they’re shown, and the unique combination of experiments.
Scale and complexity
Netflix currently handles hundreds of thousands of requests per second from about 60 countries. Their infrastructure runs multi-zone and multi-region, serving users from three different AWS regions. The only component running outside of AWS is their Netflix CDN, which currently covers about 37% of US Internet traffic.
Achieving operational excellence also involves a tough tradeoff between availability and rate of change (i.e. quality versus speed). Netflix is keen on trading some of their availability to enable fast change, and they approach the problem by means of continuous improvement of management, design, and function of operational environments. This kind of approach leads to greater quality, velocity, and competitive advantage.
The culture behind this choice can be summarized as “You build it, you run it”. It means 100% ownership, starting from designing, coding, building, testing and deploying…all the way to operating, configuring, monitoring, and responding (while doing it all globally!). They built their own software tools to enable this approach, like Spinnaker, Eureka, Hystrix, Atlas, and Vector (available on Github).
These tools are based on software engineering standards and advanced technologies:
- Anomaly detection: to identify anomalous patterns on short windows of time series events.
- Outlier detection and remediation: via unsupervised machine learning and clustering techniques.
- Canary release process: new versions of the software are available to a small percentage of the traffic, with automatic canary analysis.
- Unsupervised monitoring and decision making: take humans out of the equation and provide automatic alerts.
Another important component of Netflix’s approach is chaos engineering. Being aware that components are going to fail, they work hard on building confidence in the system’s capability to withstand turbulent conditions (directly in production). You can find their SimianArmy on Github. By using FIT (Fault-injection Testing) they can simulate service failures, both on an instance- and region-level.
Director of Engineering at Netflix, Peter Bakas – after proudly taking a picture of the crowd – explained how Netflix handles data streams of up to 8 million events per second.
Keystone handles about 550 billion events every day (more than 8 million events per second) and manipulates more than one petabyte of data, composed of hundreds of event types. Their data pipeline solution is based on open source projects, such as Apache Kafka, Apache Chukwa, and Apache Samza, besides Docker and MySQL.
Netflix Core Team
Dave Hahn talked about how it feels and how it is possible for a few DevOps engineers to handle more than 37% of the US Internet. His team – the CORE team (Cloud Operations Reliability Engineering) – is responsible for crisis management, availability reporting, reliability best practices, AWS relationship, and operations education. It is mainly composed of crisis leaders and its goals are the following:
- Protect customer experience. This is crucial at Netflix and is the key point of each operation.
- Make failures unique. This means making errors happen only once, by identifying the real root of each problem and fixing it.
- Achieve constant improvement. This takes a lot of individual effort and can be helped along by incident reviews and by encouraging honest and open feedback.
Dave described the DevOps culture they have built based on the 100% ownership concept and made easier by the many tools developed for software engineers to enable easy ownership, including service discovery, solid communication, automated recovery, continuous deployment, and data persistence.
Insights are a key factor as well: Netflix records about 2.5 billion metrics every day and needed in-house tools to help them visualize and analyze relevant patterns, via prediction and automation.
New Content: AWS Terraform, Java Programming Lab Challenges, Azure DP-900 & DP-300 Certification Exam Prep, Plus Plenty More Amazon, Google, Microsoft, and Big Data Courses
This month our Content Team continues building the catalog of courses for everyone learning about AWS, GCP, and Microsoft Azure. In addition, this month’s updates include several Java programming lab challenges and a couple of courses on big data. In total, we released five new learning...
Where Should You Be Focusing Your AWS Security Efforts?
Another day, another re:Invent session! This time I listened to Stephen Schmidt’s session, “AWS Security: Where we've been, where we're going.” Amongst covering the highlights of AWS security during 2020, a number of newly added AWS features/services were discussed, including: AWS Audit...
AWS re:Invent: 2020 Keynote Top Highlights and More
We’ve gotten through the first five days of the special all-virtual 2020 edition of AWS re:Invent. It’s always a really exciting time for practitioners in the field to see what features and services AWS has cooked up for the year ahead. This year’s conference is a marathon and not a...
WARNING: Great Cloud Content Ahead
At Cloud Academy, content is at the heart of what we do. We work with the world’s leading cloud and operations teams to develop video courses and learning paths that accelerate teams and drive digital transformation. First and foremost, we listen to our customers’ needs and we stay ahea...
Excelling in AWS, Azure, and Beyond – How Danut Prisacaru Prepares for the Future
Meet Danut Prisacaru. Danut has been a Software Architect for the past 10 years and has been involved in Software Engineering for 30 years. He’s passionate about software and learning, and jokes that coding is basically the only thing he can do well (!). We think his enthusiasm shines t...
New Content: AWS Data Analytics – Specialty Certification, Azure AI-900 Certification, Plus New Learning Paths, Courses, Labs, and More
This month our Content Team released two big certification Learning Paths: the AWS Certified Data Analytics - Speciality, and the Azure AI Fundamentals AI-900. In total, we released four new Learning Paths, 16 courses, 24 assessments, and 11 labs. New content on Cloud Academy At any ...
New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More
This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...
AWS Certification Practice Exam: What to Expect from Test Questions
If you’re building applications on the AWS cloud or looking to get started in cloud computing, certification is a way to build deep knowledge in key services unique to the AWS platform. AWS currently offers 12 certifications that cover major cloud roles including Solutions Architect, De...
Overcoming Unprecedented Business Challenges with AWS
From auto-scaling applications with high availability to video conferencing that’s used by everyone, every day — cloud technology has never been more popular or in-demand. But what does this mean for experienced cloud professionals and the challenges they face as they carve out a new p...
Constant Content: Cloud Academy’s Q3 2020 Roadmap
Hello — Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...
New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More
This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...
Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More
This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...