SRE Tools & Automation


SRE Tools & Automation
SRE Tools & Automation

This course delves into the subject of tools and automation within site reliability engineering (SRE). Automation is carried out in SRE to solve practical problems, typically those identified as toil. And having the right tools for the right job is important when performing SRE. By the end of this course, you'll have a clear understanding of the available tools and practices and how to apply each to a particular SRE automation requirement.

If you have any feedback relating to this course, please contact us at

Learning Objectives

  • Understand what automation means and the role it plays in SRE
  • Learn about the instrumentation required for automation to work
  • Learn how automation tools can be used to secure your infrastructure
  • Explore the plethora of tools and techniques to automate your workloads

Intended Audience

  • Anyone interested in learning about SRE and its fundamentals
  • Software Engineers interested in learning about how to use and apply SRE within an operations environment
  • DevOps practitioners interested in understanding the role of SRE and how to consider using it within their own organization


To get the most out of this learning path, you should have a basic understanding of DevOps, software development, and the software development lifecycle.


Link to the YouTube video reference in the course:


Welcome back. In this course, I'm going to review the subject of SRE Tools and Automation. Automation is done in SRE to solve practical problems typically those identified as toil. And having the right tools for the right job is important when performing SRE. By the end of this course, you'll have a clear understanding of the available tools and practices and how to apply each to a particular SRE automation requirement. Right, let's begin. To begin with, let's define automation.

Automation is helpful. It improves many things. For example, consider the following. Consistency. A machine executing a script over and over again will always outperform and be more consistent than a human. Engineering. A platform upon which to build, reuse, and extend on. Faster action and faster fixes. And the obvious one, time savings. Now it's important to use automation to solve problems mostly those that have a business meaning, for example, eliminating toil or improving SLOs. Automation requires tooling, therefore you need to pick the right tool for the right job. Automation costs. Upfront engineering time and effort will be required to put automation in place. And finally, consider having measurable outcomes for your automation to ensure that it's providing the expected benefits.

So what role does automation play in SRE? For starters, automation has already accomplished a lot to improve service delivery typically as part of a DevOps adoption. Consider the following DevOps delivery pipeline shown here. Developers can check and change it at one end, and those changes will flow through the pipeline towards production. The color coding shows the different kinds of environment being used. Here, blue represents test, green represents pre-prod sometimes called staging, and orange represents production sometimes called life. Let's begin to consider some of the issues with this approach.

DevOps pipelines are often driven by developers and engineers wanting more changes reliably flowing through production. For developers and engineers, responsibility and accountability typically ends at the end of test. Despite the intent of DevOps to break down silos, it's still common for other teams to own and support pre-prod and prod. The push to realize value from development effort means that new features are pushed onto the team supporting production often with little-to-no knowledge transferred. The main operations bottleneck of deployment is addressed through automated deployment, meaning an ever-increasing number of deployments being made. The production support team are then expected to keep this ever-growing list of features and services running despite the uniqueness of production. Therefore, all penalties are on them, leading to excessive toil and bigger support costs.

Developers often assume that all environments are consistent, however, environment mismatch is often the root cause of deployment problems. Developers team to take on ownership of local environments, dev and test environments, but are often locked out of prod and or pre-prod which is typically owned by Ops or another department. Pipelines typically focus on adding more and more testing steps, both functional and nonfunctional. This introduces false confidence that features will work when they get to production. Testing tries to be but is often never identical to production. Prod has real user data, real live dependencies, and different resources, for example, consider servers and networks. Production is the ultimate test ground.

Monitoring and alerting is configured on things that we know about and monitors things that we know have gone wrong in the past. However, again the uniqueness of production usually means incidents and issues are caused by unique circumstances, things that we can't hope to predict and therefore do not plan to monitor and therefore miss entirely. Now that we know where improvements can be made, let's go ahead and discuss what good automation looks like in the context of SRE.

Automation is operations led to ensure reliability engineering priorities. Environments must be provisioned using Infrastructure-as-Code and Configuration-as-Code process. Infrastructure as Code is code that you commit, the same as application code, and the same code is used to create all environments, dev, test, pre-prod, and prod. We use the concept of immutability, that is, environments which do not mutate, but when change is required, is done so using clean rebuilds.

Taking this approach, all environments are therefore consistent and reflect prod. As an outcome, environments provisioned using Infrastructure and Configuration as Code will be consistent, repeatable, and production ready, will be testable and auditable, and you'll be able to easily reproduce errors in non-prod environments to aid troubleshooting activities when required. Next, we need to perform automated functional nonfunctional tests in production.

In SRE, the pipeline is designed to focus on introducing more and more reliability. Quite often testing stops before production. SRE requires that some testing is still performed in production, both functional and nonfunctional. A functional test may perform a transaction in prod using a dummy account. Nonfunctional tests include for example security and performance testing or even just health tests.

For example, is the system running? Can I connect to it? Prod testing enables automated rollback when required, canary testing for example. If a test fails in production, that is, the canary dies, then we automatically rollback to the previous version. We can also use immutability or dual prod for this, e.g., blue/green or A/B deployments. In this case, we deploy the green version of the code and rerun all tests. If any test fails, we don't migrate users across. Or if all tests pass, then we migrate users across to the green version. Performing this, we find that systems become well-tested but issues can still be discovered in production. The cost of change is lowered and the risk of regression is reduced.

Using versioned and signed artifacts to deploy system components. The build step needs to version and digitally sign the components that make up the servers, and all components should be securely stored in a suitable artifact repository, for example, Nexus or Artifactory. In doing so, the promotion of change is automated, dependency errors can all but be eliminated, and it's easier to determine security vulnerabilities.

Next, we need to have instrumentation in place to make the service externally observable. We often test that monitoring is set up and providing data points in pre-prod. We also need to test the same in prod. We also need to ensure that we have the correct data and service level indicators being returned as well as log files being generated and stored. Tools like Nagios, Prometheus, Splunk, and Catchpoint are available to instrument services so that they can be made externally observable.

Tools like Logstash allow us to aggregate log files which are useful in cases of failure and to support blameless post-mortems. Performing this, we find that security and audit events are centralized, it assists with protective monitoring, and you can reduce the mean time to fix by granting developers read-only access to logs. Understanding and dealing with future growth, we can use performance testing to simulate that our service can scale up to meet required future growth.

For example, auto scaling can be used to ensure services scale to meet predicted and unpredicted demand. When this takes place, toil is minimized upfront, rework of the system is reduced, and the overall total cost of ownership, TCO, for the service is lowered.

Finally, consider having a clear anti-fragility strategy in place. Consider using chaos engineering to test failures and then address any discovered failures that lead to service disruption. Make sure that a DR plan is in place and has been practiced and validated using fire drills. Ensure that you have the right on-call personnel ready and available and the correct tools in place to alert and support unplanned incidents.

Consider using tools such as PagerDuty, VictorOps, and Squadcast. Performing this ensures that the availability and integrity risks to the system are appropriately mitigated. Mitigations are evidenced and tested again reducing risk. Now having SRE-led service automation puts extra focus on prod, but it also aids with DevOps, gaining the wisdom of production.

Finally, site reliability engineering and site reliability engineers themselves can say no to production deployments if they believe reliability is not in place. Consider the following, SRE automation is not just about service automation, instead automation is done in SRE to solve practical problems typically those identified as toil. Respondents to the 2019 Catchpoint SRE Survey indicate that toil-reducing automation is either nonexistent or, at best minimal within their own organizations.

Again automation shouldn't just be about automating the service, configuration, deployment, etc, instead SRE requires individuals to work like software engineers on things that are operations-related. Alright, let's move on and now briefly discuss the concept of automation types and the hierarchy which ranks them, an idea originated by Google and for which is relevant to organizations embracing SRE.

Interestingly, automation can be considered to exist in different forms, and not only that, but that there is a hierarchy of automation types as seen here. Starting at the bottom and moving up, we find that automation matures as you step up the hierarchy. For example, consider a database which needs to be managed and maintained. At the very bottom, we find that database failure can be manually addressed. At the top end of the hierarchy, we find that automation is exhibited in the form of a database which is self-aware to take then its own internal problems and performing automatic failovers all without human intervention.

Moving on now, let's consider secure automation and how tools in automation can be used to enhance the level of security protection. Secure automation. Automation removes the chance of human error or willful sabotage and provides security opportunities. We can secure automated steps in the pipeline. We cannot provenly secure manual steps. Artifacts generated and used by the pipeline can be validated and checked for compliance.

DevOps advocates a pipeline of delivery. DevSecOps works to secure that particular pipeline. SRE places extra emphasis on security of production. Secure build. SRE is moving towards a mandate that everything should be in code, infrastructure-as-code, configuration-as-code, as well as the application code. All codes should be written and developed securely. Code is then committed securely into code repositories with regulated access. All build artifacts are digitally signed using the approved organization's certificates. And published security coding practices are not only embraced and implemented, they are considered the norm.

Secure test. Immutability of infrastructure and applications is a key DevOps and SRE concept. The same artifact is deployed across all environments using environment variables and configuration to handle any deltas. Secure and un-secure test data should be used to test the security boundaries of the service. Secure staging. Staging environments are also immutable. The same artifact is deployed to staging and or pre-prod environments.

Staging data, if cloned directly from prod, may include personally identifiable information. If this is so, then you need to consider how data-handling rules such as GDPR apply. Consider also implications when testing financial and or payment-related data. Here, PCI compliance has to be checked. Testing in test environments is often stateless whereas we often introduce state in pre-prod and staging.

Staging and pre-prod environments are typically where dependencies and or integrations are tested. A dependent service may need to change as it may not be suitably secure. These are proxy security requirements that typically need to be delivered before production deployment. As the quote displayed here implies, beware of the false confidence testing gives even when done as robustly as in staging and pre-prod.

Secure production. Again production environments are also immutable. The same artifact again is deployed to the production environment. Production data requires data security compliance, for example, GDPR, PCI, and or SOX, S-O-X. Dedicated security scanning should be used to try and uncover security vulnerabilities. There may be other compliance requirements and audit needs. Failure testing can prove to auditors that there are sufficient controls in place to ensure smooth running and going concern issues.

The quote displayed here outlines that compliance can be made easier through pipelines of delivery and as code for configuration and infrastructure. Automation tools. In this section, we'll look at the tools that are available to perform service automation, and then provide you with an exercise to help you discover and think about automation tools that you may already be using or should be considering.

Now to begin with, rather than do a fashion parade of the latest tools, we'll instead look at the types of tools available, the business protocols around which tools are used, and how to ensure the best tools are picked for the right job. Before we begin, consider the following case story presented by Standard Chartered. At Standard Chartered, it was acknowledged that tooling was required to enable SRE principles. The strategy did not name the tools that needed to be used, but instead set principles at the global level.

For example, everything is code means everything, software infrastructure configuration should be as code. It does not dictate which tool to use, for example, Terraform or Puppet, therefore as long as teams were embracing the principle, they had autonomy to cherry pick the right tools themselves. Consider the use of automation within your own organization, what type it is, where it is used, the tools used, and what was the business reason that justified it. To help you along, consider the following table.

In the following slides, I'll then categorize and review each of the stated automation types. Beginning with manage. Audit management. The use of automated tools to ensure products and services are auditable, including keeping audit logs of build, test, and deploy activities, auditing configurations and uses as well as log files from production operations. Authentication and authorization.

Mechanisms for ensuring appropriate access to products, services, and tools. For example, user and password management and two-factor authentication. Cloud providers use their own tools such as AWS IAM. DevOps score. A metric showing DevOps adoption across an organization and the corresponding impact on delivery velocity. Value stream management. The ability to visualize the flow of value delivery through the DevOps lifecycle. Gitlab CI and the Jenkins extension, DevOptics can provide this visualization.

Next up, plan. Issue tracking. Tools like Jira, Trello, CA's Agile Central, and VersionOne can be used to capture incidents or backlogs of work. Kanban Boards. On the back of issue tracking, the same tools can represent delivery flow through Scrum and Kanban workflow boards. Time tracking. Similarly, issue tracking tools also allow for time to be tracked either against individual issues or other work or project types. Agile portfolio management involves evaluating in-flight projects and proposed future initiatives to shape, and govern the ongoing investment in projects and discretionary work. Again CA's Agile Central and VersionOne are examples.

Service desk. Service Now is a well-used platform for managing the lifecycle of services as well as internal and external stakeholder engagement. Requirements management. Tools that handle requirements definition, traceability, hierarchies, and dependency. Often also handles code requirements and test cases for requirements. Quality management. Tools that handle test case planning, test execution, defect tracking, severity, and priority analysis. For example CA's Agile Central.

Next up, create. Source code management. Tools to securely store source code and make it available in a scalable multi-user environment. Git and SVN are popular examples. Code review. The ability to perform peer code-reviews to check quality can be enforced through tools like Gerrit, Team Foundation Service, Crucible, and Gitlab. Wiki. Knowledge sharing can be enabled by using tools like Confluence which create a rich Wiki of content. Web IDE. Tools that have a web client integrated development environment. Enables developer productivity without having to use a local development tool.

Snippets. Stored and shared code snippets to allow collaboration around specific pieces of code. Also allows code snippets to be used in other code-bases. Both BitBucket and Gitlab allow this. Moving on to Verify. Continuous integration refers to integrating, building, and testing code within the development environment. Code quality. Also referred to as code analysis. Sonar and Checkmarks are examples of tools that automatically check the seven main dimensions of code quality, comments, architecture, duplication, unit test coverage, complexity, potential defects, and language rules.

Performance testing. Performance testing is the process of determining the speed, responsiveness, and stability of a computer, network, software program, and or device under a workload. Usability testing. Usability testing is a way to see how easy to use something is by testing it with real users. Tools can be used to track how a user works with a service. For example, scroll recording, eye checking, and mouse tracking.

Moving on to package. Package registry. A repository for software packages, artifacts and their corresponding metadata. Can store files produced by an organization itself or for third party binaries. Artifactory and Nexus are amongst the most popular. Container registry. Secure and private registry for container images. Typically allowing for easy upload and download of images from build tools.

Dependency proxy. For many organizations, it's desirable to have a local proxy for frequently used upstream images or packages. In the case of CI/CD, the proxy is responsible for receiving a request and returning the upstream image from a registry, acting as a pull-through cache. Helm chart registry. Helm charts describe related Kubernetes resources. Artifactory and Codefresh support a registry for maintaining master records of Helm charts.

Dependency firewall. Many projects depend on packages that may come from unknown or unverified providers, introducing potential security vulnerabilities. There are tools to scan dependencies but that is done after they are downloaded. These tools prevent those vulnerabilities from being downloaded to begin with. Moving on to secure. Static application security testing tests applications from the inside out by looking at source code, byte code, or binaries.

Dynamic application security testing tests applications from the outside in to detect security vulnerabilities. Interactive application security testing combines both SAST and DAST approaches but involves application tests changing in real time based on information feedback from SAST and DAST, creating new test cases on the fly. Synopsis, Acunetix, Parasoft, and Quotium are solutions evolving in this direction. Secret detection. Secret detection aims to prevent sensitive information like passwords, authentication tokens, and private keys being unintentionally leaked as part of the repository content.

Dependency scanning. Used to automatically find security vulnerabilities in your dependencies while you are developing and testing your applications. Synopsis, Gemnasium, Retire.js, and bundler-audit are popular tools in this area. Container scanning. When building a container image for your applications, tools can run a security scan to ensure it does not have any known vulnerability in the environment where your code is shipped. Blackduck, Synopsis, Synk, Claire and Klar are examples.

License compliance. Tools such as Blackduck and Synopsis perform checks to ensure licenses of your dependencies are compatible with your application and either approve or blacklist them. Vulnerability database is aimed at collecting, maintaining, and disseminating information about discovered computer security vulnerabilities. This is then checked as part of the delivery pipeline. Fuzzing. Fuzzing or fuzz testing is an automated software testing technique that involves providing invalid, unexpected, or random data as inputs to a service and then watching the results.

Next, release. Continuous delivery is a software development discipline where you build software in such a way that the software can be released to production at any time. Release orchestration. Typically a deployment pipeline used to detect any changes that will lead to potential problems in production. Orchestrating other tools will identify performance, security, or usability issues. Tools like Jenkins and Gitlab CI can orchestrate releases.

Pages. For creating supporting web pages automatically as part of a CI/CD pipeline. Review apps allow code to be committed and launched in real time. Environments are spun up to allow developers to review their applications. Gitlab CI has this capability. Incremental rollout. Incremental rollout means deploying many small, gradual changes to a service instead of a few large ones. Users are then incrementally moved across to the new version of the service until eventually all users are moved across. Sometimes referred to by colored environments, e.g., blue/green deployments.

Canary deployments. Similar to incremental rollout, it is where a small portion of the user base is updated to a new version first. This subset, the canaries, then serve as the proverbial canary in the coal mine. If something goes wrong, then a release is rolled back and only a small subset of the users have been impacted. Feature flags, sometimes called feature toggles, a technique that allows system behavior to change without changing the underlying code through the use of flags to decide which behavior is invoked. This is primarily a programming practice although there are tools such as Launch Darkly which can help with flag management and invocation.

Release governance. Release governance is all about the controls and automation, security compliance or otherwise, that ensure your releases are managed in an auditable and trackable way in order to meet the need of the business to understand what is changing. Secrets management. Secrets management refers to the tools and methods for managing digital authentication credentials, secrets, including passwords, keys, APIs, and tokens for use in applications, services, privileged accounts, and other sensitive parts of the IT ecosystem.

Next up, configure. Auto DevOps. Auto DevOps brings DevOps best practices to your project by automatically configuring software development life cycles. It automatically detects, builds, test, deploys, and monitors applications. Gitlab and AWS Code Pipelines are strong examples. ChatOps. The ability to execute common DevOps transactions directly from chat, build, deploy, test, incident management, rollback, et cetera, with the resulting output sent back to the ChatOps channel.

Runbooks. A collection of procedures necessary for the smooth operation of a service. Previously manual in nature, they are now usually automated with tools like Ansible. Serverless. A code execution paradigm where no underlying infrastructure or dependencies are needed. Moreover, a piece of code is executed by a service provider, typically cloud, who takes over the creation of the execution environment. Lambda functions in AWS and Microsoft's Azure functions are good examples. Next up, monitor. Metrics. Tools that collect and display performance metrics for deployed apps such as Prometheus.

Logging. The capture, aggregation, and storage of all logs associated with system performance including but not limited to process calls, events, user data, responses, error, and status odes. Logstash and Nagios are popular examples. Tracing. Tracing provides insight into the performance and health of a deployed application, tracking each function or microservice which handles a given request.

Cluster monitoring. Tools that let you know the health of your deployed environments running in clusters such as Kubernetes. Error tracking. Tools to easily discover and show the errors that an application may be generating along with the associated data. Incident management involves capturing the who, what, when of service incidents, and the onward use of this data in ensuring service level objectives are being met.

Synthetic monitoring. The ability to monitor service behavior by creating scripts to simulate the action or path taken by a customer or end user and the associated outcome. Status page. Service pages that easily communicate the status of services to customers and end users. The last category, defend. RASP, runtime application self-protection. Tools that actively monitor and block threats in the production environment before they can exploit vulnerabilities. WAF, web application firewall. Tools that examine traffic being sent to an application and can block anything that looks malicious.

Threat detection refers to the ability to detect, report, and support the ability to respond to attacks. Intrusion detection systems and denial-of-service systems allow for some level of threat detection and prevention. UEBA, user and entity behavior analytics is a machine learning technique to analyze normal and abnormal user behavior with the aim of preventing the latter. Vulnerability management is about ensuring that assets and applications are scanned for vulnerabilities and then the subsequent processes to record, manage, and mitigate those vulnerabilities. DLP, data loss protection. Tools that prevent files and content from being removed from within a service environment or organization.

Storage security. A specialty area of security that is concerned with securing data storage systems and ecosystems and the data that resides on these systems. And finally, container network security. Used to prove that any app that can be run on a container cluster with any other app can be confident that there is no unintended use of the other app or any unintended network traffic between them. Okay, we're almost finished, but first, consider the Ironies of Automation, a YouTube-hosted video which is worth a watch. One of the key messages provided within is the idea that automation is not an end in itself, rather it is a means to an end. Also automation needs to be managed appropriately, and like services themselves, has a lifespan.

Okay, that now completes this course. In this course, you learned about SRE tooling and practices and how to use automation to improve and support your DevOps delivery pipelines. Go ahead and close this course, and I'll see you shortly in the next one.

About the Author
Learning Paths

Jeremy is a Content Lead Architect and DevOps SME here at Cloud Academy where he specializes in developing DevOps technical training documentation.

He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 25+ years. In recent times, Jeremy has been focused on DevOps, Cloud (AWS, Azure, GCP), Security, Kubernetes, and Machine Learning.

Jeremy holds professional certifications for AWS, Azure, GCP, Terraform, Kubernetes (CKA, CKAD, CKS).