SRE Reducing Toil

Contents

keyboard_tab
Reducing Toil
1
Reducing Toil
PREVIEW14m 31s

The course is part of this learning path

Site Reliability Engineering (SRE) Foundation
course-steps
10
certification
1
description
3
play-arrow
Reducing Toil
Overview
DifficultyIntermediate
Duration15m
Students85
Ratings
5/5
starstarstarstarstar

Description

This course looks at what toil is, and why having less of it is a good thing. Toil, as quoted here by Google, is "the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as the service grows." By the end of this course, you will have a clear understanding of what toil is, how to recognize it, and how to address and replace it with automation.

If you have any feedback relating to this course, please contact us at support@cloudacademy.com.

Learning Objectives

  • Understand what toil is and how to recognize it
  • Learn the negative impact that toil can have on organizations, teams, and individuals
  • Explore how toil can be reduced within the context of site reliability engineering

Intended Audience

  • Anyone interested in learning about SRE and its fundamentals
  • Software Engineers interested in learning about how to use and apply SRE within an operations environment
  • DevOps practitioners interested in understanding the role of SRE and how to consider using it within their own organization

Prerequisites

To get the most out of this learning path, you should have a basic understanding of DevOps, software development, and the software development lifecycle.

Resources

Link to the YouTube video referenced in this lecture: Automate Yourself out of a Job

Transcript

Welcome back. In this course, I'm going to review what toil is, and why having less of it is a good thing. By the end of this course you will have a clear understanding of what toil is, how to recognize it, and how to address and replace it with automation. Right, let's begin.

Toil as quoted here by Google, is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and that scales linearly as the service grows. Let's now pull out and highlight each of the key defining characteristics of toil.

Starting with manual, if you're doing a lot of manual hands-on activity, then this is without doubt toil. Day to day examples of manual toil could be: manual or semi-manual releases, connecting to infrastructure to check something, or constant password resets.

Now, when Google analyzed their own toil, they identify their three top types of toil as being, one, interrupts, nonurgent service-related messages and/or emails. Two, on-call or urgent responses, and three, releases and pushes, were often performed manually or at best semi-manually. This is still toil and should be removed. Recognizing and removing this type of toil can be accomplished by automating deployments, approving deployments by automation using tools such as Slack and/or connecting to infrastructure to monitor metrics, moving from a eyes on glass to alerting mode. Creative experimentation to solve a problem or create a solution may be repetitive, but it is not toil.

Next up is repetitive, repetitive work is toil. Doing the same thing over and over again is toil and should be removed. For example, testing once manually is fine, doing the same thing twice means it should be at least recorded. Doing manual tests over and over is toil, and should be automated where financially viable. If the first thing you do each morning is to acknowledge an overnight alert, e.g disk space over 80% or CPU has hit 100% without doing anything about it, then this is toil. Interrupts are distractions, e.g. dealing with non-urgent service-related emails and/or messages.

Number one on Google's top three toil list was identified as interrupts, non-urgent service related messages and emails. And this is certainly a form of repetitive toil, and again shall be removed. Next up is automatable. If work can be automated then it should. You may not immediately recognize such work as being automatable. But take time to think laterally about your day to day jobs and opportunities to automate such work.

For example, consider the following, one, organizations often schedule physical meetings where you need to attend in person to approve changes. For example, change advisory board meetings or CAB meetings. Consider changing this to an online virtual meeting. This is not automation in the usual sense of writing a script, but it is a style of automation improving and reducing toil. Two, manually switching on equipment e.g. monitoring screens or TVs every morning because they are automatically powered down at the end of the day is also something that can be automated. And three, many organizations have manual processes around creating, changing and deleting users of services, usually, via helpdesk tickets and follow up manual actions within LDAP directories such as Active Directory, this more often than not, is something that should be automated, again, reducing overall toil.

Tactical-based toil is interesting and perhaps less obvious. Consider the following, organizations often have known workarounds for urgent issues using a manual step if something in the system doesn't work. Accepting this as normal is accepting toil. Toil can impact end users too, not just backend teams. Being on call is par for the course when it comes to providing 24/7 support and this comes with unavoidable on call alerts. Quite often these results in one-off tactical fixes.

Recalling Google's top three toil list at number two was, on-call agent responses. If you're performing a lot of no enduring value, or undifferentiated heavy lifting then this is to also be considered toil. For example, if I see this as delivering the same value after completing a task, then that task was toil. A more concrete example of this might be responding to a user request to extract data, this will tend to please that particular user, but the service value as a whole has not increased and is therefore toil. This is potentially automatable via self-service functionality.

Finally, work is toil when it scales linearly. For example, consider the following, using cloud and auto-scaling or internal infrastructure auto-scaling is scalable, whereas manually adding infrastructure as usage of the service grows is not. The service should be allowed to scale without manual intervention. When recognizing what toil is, it's also important to recognize what toil isn't. Let's do so now.

Toil should not be confused with any of the following: one, regular work. Two, making improvements or implementing feature requests. Three, toil must relate to a service under management, and four, creative experimentation to solve a problem or create a solution may be repetitive, but it is not toil. Different and unrelated tasks are not toil, it has to be the same task for it to be considered toil.

Now that we understand what toil is and how to recognize it, let's move forward and elaborate on why toil is a bad thing from the perspective of organizations, teams and/or individuals. High toil is bad, period. Depending on your perspective, toil has different impacts, as seen here the impact of toil at the organizational level is different to that which acts on the individual, pause here and take some time to reflect on the different types of high toil impact, and then consider your current role within your own organization and the impact of toil within it and on yourself.

The following quote provided by Rundeck provides insight into what happens when toil is left unaddressed. If you aren't careful, the level of toil in an organization can increase to the point where the organization won't have the capacity needed to stop it. This can be summed up as engineering bankruptcy, a technical debt metaphor. Engineering bankruptcy highlights the problem of unaddressed toil. Unaddressed toil can go critical forming a chain reaction where the future is nothing but toil.

It ends up taking so much of everyone's time and effort to the point where there is no capacity left to do anything about it, future products and services cannot be developed or improved, stifling innovation and resulting ultimately in engineering bankruptcy. Let's now focus our energies on how we can go about reducing toil. Before we do so, consider the quote here given by Google, SRE is what happens when you ask a software engineer to design an operations team. The intention expressed here is to show that software engineering and automation are key components to solving the problems posed by toil.

Solutions to toil require engineering effort. SRE as we saw on the previous slide, is all about using engineering to solve operational problems. Some toil examples and how to address them include: one, manual releases, toil reduced by creating external automation to do automated releases. Two, manually scaling infrastructure, toil reduced by using external automation such as cloud-based auto scalers to perform the required scaling. Three, manual password resets, toil reduced by enhancing the service to be offered up is a self-service feature. And four extracting data, toil reduced by using internal automation such as a database query through a feature or tool such that users can again self-serve this. Pause here and take some time to consider what work you perform, whether it be daily, weekly, or unscheduled work. And in the next slide, you'll then be given the opportunity to address each identified work item as potentially toil.

Based on the work items you identified in the previous slide, now perform the following three-part workflow to identify and reduce toil. Part One, identifying toil. Consider your what we do all day toil items. Look for those that fit the model of toil, e.g. manual, repetitive, automatable, tactical, no enduring value or linear scaling. If you find any note the toil type, remembering some might be more than one type, for example, something maybe manual, repetitive and automatable.

Part two, addressing toil. Outline potential solutions to reduce your identified toil. Automation may be a large part of the answer, but new tools, new techniques, and different ways of working might also be part of the solution. Solutions will either usually involve external or internal automation. External automation may involve scripts and/or automation tools outside of the service. Internal automation on the other hand, may be delivered as part of the service itself, or by enhancing the service to not require any further intervention.

Part three, reducing toil. Prioritize identified toil items in terms of value and ease of delivery. Let's move on now and talk about how we need to make engineering time available. Addressing and reducing toil requires engineering time, sometime must be reinforced to deal with toil. The Google way is to allocate a 50% time limit for engineers to work on toil. This is decided at least 50% of each SREs time should be spent on an engineering project that will either reduce future toil or add service features. The 50% rule ensures that any one team or person does not become the ops team or person. This ensures that the required side and ability knowledge and skills are evenly dispersed across the entire SRE team. The 50% rule is an average to reflect real world scenarios. Your requirements in rule setting may differ slightly.

In the following case study presented here, Slack provides insights into their path towards implementing SRE. In particular, they scaled up from 100 AWS EC2 instances to 15,000 within a small four year period. This caused excessive toil due to low-quality noisy alerting. This resulted unfortunately with their ops team, being so consumed by interrupt-driven toil, that they were unable to make progress on improving reliability.

Now, to address this, Slack explicitly committed to the importance of site reliability over feature velocity, with operational ownership of services then being pushed back to the development teams. This ultimately resulted in the development teams making the necessary code fixes to reduce and prevent excessive and unnecessary future incident alerts. The conclusion here is that the toil was identified, addressed and reduced, a good result.

A common proposition often raised when addressing toil is the idea of automating everything, but an equally important question should be raised, which is, is this worth doing? To answer this, consider the following. There are costs and benefits associated with the engineering effort when automating away toil. You need to consider the engineering investment versus the payback when the toil has been removed.

Consider, for example, a five year life of a service, if you were to do something every week and it takes you exactly one minute to do it, for example, acknowledging a Monday morning disk space alert, then you have a budget of up to four hours to automate that task away. If it takes longer than four hours to automate it, then it's probably not worth the effort.

Another interesting perspective on the answer to the previous question is provided in the following YouTube hosted video, in particular, consider reviewing the section titled automate yourself out of a job. Now that you are familiar with toil, consider what you have learnt about it, and then take some time to reflect on the following question. What are the benefits of toil reduction on individuals and teams within your own organization?

All right, we are almost finished reviewing toil. But before I conclude on this course, consider the following case study presented by Accenture. Accenture found that by identifying, addressing and reducing toil within their own organization, had the following important benefits. One, reducing toil was hugely positive in protecting the team. Two, reducing toil also reduced staff turnover, and three, with toil removed away real work was safest and became highly visible within the organization.

Okay, that completes this course. In this course, you learned about what toil is, how to identify, address, and reduce it, and ultimately why having less of it is a good thing. Okay, close this course, and I'll see you shortly in the next one.

About the Author
Students36581
Labs33
Courses93
Learning paths23

Jeremy is the DevOps Content Lead at Cloud Academy where he specializes in developing technical training documentation for DevOps.

He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 20+ years. In recent times, Jeremy has been focused on DevOps, Cloud, Security, and Machine Learning.

Jeremy holds professional certifications for both the AWS and GCP cloud platforms.