How Google, HP, and Etsy Succeed with DevOps

DevOps is currently well developed, and there are many examples of companies adopting it to improve their existing practices and explore new frontiers. In this article, we’ll take a look at case studies and use cases from Google, HP, and Etsy. These companies are having success with DevOps by applying the Three Ways of DevOps.

To learn DevOps and build your skills with practical, hands-on labs, you’ll want to check out Cloud Academy DevOps Training Library. From the fundamentals to advanced, there are DevOps Certification Learning Paths to get certified in DevOps across cloud platforms.

Cloud Academy DevOps Training Library

 

Improving velocity and quality with trunk-based development at HP

The DevOps Handbook recounts Gary Gruver‘s time as director of engineering for HP’s LaserJet firmware division. Here’s how Gruver describes his situation before applying DevOps practices:

Marketing would come to us with a million ideas to dazzle our customer, and we’d just tell them, ‘Out of your list, pick the two things you’d like to get in the next six to twelve months.’

Gruver’s objective was to improve velocity and quality. He estimated that developers spent 5% actually developing new features. The other 95% went into planning, porting code to dedicated branches, integrating code with other developers, and manual testing.

He planned to adopt trunk-based development (also known as continuous integration) backed by automated testing. Trunk-based development would allow his team to ship all their firmware from a single code branch. This would also reduce the toil caused by integrating multiple code branches. Automated testing would drastically reduce the six week manual test cycle time and raise quality.

Four years later, the firmware team had adopted trunk-based development and automated testing. Testing printer firmware is no small task, so the team invested time in writing their own simulator. Eventually, their system scaled up to 2,000 simulators running on six server racks to support their deployment pipeline. These changes led to the following positive business outcomes:

  1. Cut regression test times from six weeks to one day
  2. Reduced overall development cost by 40%
  3. Went from 20 gated commits per day to over 100 commits per day, per developer

Gruver’s experience at HP demonstrates trunk-based development’s transformative power when backed by automated testing. There was also an unexpected benefit. He found that his engineers were happier after these changes and would pine for the workflow they applied at HP after moving to other companies. Accelerate confirms his anecdotal evidence. Adopting continuous delivery creates happier and more satisfied employees.

InfoSec visibility at Etsy

Etsy appears often in the DevOps literature due to their frequent blogging and conference talks in the early days. They were known for their focus on increasing deploy frequency and espousing the wonderfully named “church of graphs“. They even graphed and monitored the amount of coffee in their office coffee pot.

You may question the utility of such an exercise, so consider for a moment what you may learn from the coffee pot capacity graph. You could learn how often the coffee pot is empty or how often it’s used. You could use this information to decide to purchase another coffee pot (horizontal scale out) or purchase a larger coffee pot (vertical scale up). You may also decide to throttle the most frequent coffee drinkers to save coffee for others. Now, you may be thinking that you don’t need a graph to make that decision. That’s true. A guess is perfectly acceptable in this scenario. However, the graph provides the required data so that you don’t have to guess, instead, you can decide. This is really what the “church of graphs” is all about.

Teams tend to start with standard infrastructure telemetry. This includes data on latency, successful results, and errors. That’s useful in the beginning but businesses require higher-level information as they grow. Nick Galbreth, then director of engineering at Etsy, set out to provide to the business.

His goal was to add info-sec telemetry to production environments. He instructed developers to instrument unexpected errors such as SQL syntax errors (which may indicate a SQL injection attack) and routine product events like password reset requests and successful or failed login attempts.

The end result was shocking. The team realized they were under attack far more than expected. The telemetry quantified the issue, thus providing a way to assess progress on eliminating potential issues. Here’s Galbreth:

One of the results of showing this graph was that developers realized that they were being attacked all the time! And that was awesome, because it changed how developers thought about the security of their code as they were writing the code.

Note the surprise and happiness in his quote. You can imagine a giddy engineering directory smiling at the thought of this newfound perspective. The telemetry taught the team something new and ultimately allowed them to mitigate previously known regressions. That’s awesome and cause for celebration!

Continuous learning at Google with SRE

Google pioneered SRE (Site Reliability Engineering). SRE began as a way to manage Google’s production operations requirements. Google’s organizational and technical scale makes them a special case, but we can still learn from them. Google is continually launching new products and services. Naturally, each has different operational requirements given their intended user base and business importance. Google’s SRE team supports these different environments through a launch readiness review (LRR) checklist and hand-off readiness review (HRR) checklist.

New services are run by the team and only receive production traffic after passing the LRR. The team must run the service in production six months before considering handing-off production operations to the SRE team. That requires passing the HRR checklist. This process reinforces institutional learning and exposes production operations to all engineers.

The SRE team maintains the LRR and HRR checklists. They’re updated after each launch to account to successes and failures. The checklist act as institutional memory that passes on best practices to continually growing engineering team. The 6-month self-run period provides the team with invaluable hands-on production experience. That helps them improve the production environment and the operational traits of software they ultimately ship to production. More importantly, this prevents teams from simply throwing services over the wall to SRE. These practices build trust between engineers, and at the same time ensure the SRE team is only responsible for vetted services. This is a critical step in the workflow because Google’s SRE team is responsible for Google’s most important services.

Google’s SRE approach demonstrates that it’s possible to maintain loosely coupled development and operations teams. This approach only works because the teams run services in production themselves and SRE’s are software engineers themselves. Treynor Sloss, SRE lead at Google, says SRE is what happens when you apply software engineering to operations. This organization structure creates continued learning throughout the organization without sacrificing quality or velocity.

Applying the Three Ways of DevOps

Each of these case studies relates to one of the Three Ways of DevOps. Let’s look how each one fits into the wider DevOps philosophy, use cases for technical practices, and other business motivations.

Gary Gruver’s work at HP demonstrates the use case for the two practices that underpin continuous delivery: trunk-based development and automated testing. These two practices target two important metrics: lead time and deployment frequency. The business case was clear. Developing the mandatory printer firmware was too slow and costly. That inhibited HP’s broader competitive strategy. If the team was able to ship faster and improve quality, then HP could iterate on their product line faster, thus providing more business value to their customers. This is Principal of Flow in a nutshell.

Nick Galbreth’s work demonstrates the second principle: The Principle of Feedback. The Principal of Feedback calls for applying telemetry data to all SDLC phases, production being the most important. Production telemetry enables teams to make informed decisions about their production environment and also feeds back into subsequent development work. This relates to the mean-time-to-resolve (MTTR) metric. Teams should strive to reduce this metric because it means more reliable production environments. However, reducing MTTR can only be done with adequate telemetry and automated alerts that detect issues before they become outages. Teams will improve their telemetry over time as they learn to identify increasingly faint regression signals.

This leads directly to the third principle of continuous learning and experimentation. No business is static. Businesses must continually adapt and learn to stay ahead of the competition. Google’s SRE approach demonstrates the ability to experiment with new organizational structures that support its business goals. Google’s production operations are certainly a few standard deviations away from the mean, so they require engineers specifically skilled in their environment. Also, they cannot divorce all engineers from the production environments because that would have negative consequences. Allowing — and even more importantly — trusting the SRE team to handle operations has achieved higher levels of reliability and allowed the team to disseminate best practices throughout the organization through checklists.

Your path to DevOps success

You can achieve the same results — like Google, HP, and Etsy — for your team or business with the right education and training. You can come at this problem from any of the three principles. I recommend starting with understanding how to measure and understand DevOps success. Then you’re ready to apply the thee principles. Each principle leads into the other, so identify a high impact area and go from there.

Cloud Academy’s deep catalog of DevOps learning paths covers everything you need to know to achieve the Principle of Flow. The AWS Monitoring and Auditing Learning Path trains to achieve the Principle of Feedback. The DevOps Culture Learning Path prepares for building the required culture of continuous learning and experimentation.

Cloud Academy AWS Services Monitoring & Auditing
Cloud Academy