I spent the last month reading The DevOps Handbook, a great book regarding DevOps principles, and how tech organizations evolved and succeeded in applying them.
As a software engineer, you may think that DevOps is a bunch of people that deploy your code on production, and who are always accountable for keeping the infrastructure up and running. But, here is the key: DevOps are principles — not people — and the principles should be embraced and adopted by the entire organization, including Operations, Product and project owners, and especially by software engineers.
Over the last three years as a software engineer, I saw our team progressively adopting DevOps principles. In this kind of journey, it’s difficult to understand the real triggers of change, but these are the crucial principles and actions that reshaped our way of doing things from my point of view.
The Cloud Academy DevOps Playbook is an ideal start point for any organization, team, or practitioner looking to transform their business by moving to a DevOps culture. The Playbook enables you to quickly absorb and get started using the fundamental practices of DevOps, AGILE, and continuous delivery/continuous integration.
Critical DevOps principles
1. Reproducible local environments
Making your local environment reproducible and portable using Docker should be the foundation of your transformation plan, and this should be the top priority. Adopting Docker and its ecosystem to build and run local environments allows software engineers to work with local environments with the same configuration reducing unpredictable conditions. At Cloud Academy, we began with taking one week to onboard and set up the local environment for a new employee and transitioned to taking less than two days.
2. Trunk-based development
GitFlow is great for solid and mature products. But if you need to iterate frequently and deliver value as fast as you can to validate your features, it probably is not the best approach. Trunk-based development encourages small batches and short-lived feature branches to reduce the release life cycle. If it fits well, it can drastically increase the number of deployments per day and reduce the number of changes per deployment.
By relying on frequent releases and fast pipelines, software engineers are more likely to start introducing smaller changes. This will help improve peer review readability and reduce deployment unpredictability.
3. Decouple deployments from releases
This can be achieved by introducing a feature toggle service that allows dark launching features to release safely, hiding them completely to the customers. Feature toggles let you enable a feature based on multiple conditions, such as environment, user base segmentation, etc. In this way, a feature can be released in production, turned on only for internal users, and can be evaluated safely without affecting the entire user base.
By adopting feature toggling, software engineers are able to deploy as soon as they are confident in their features and have more time for testing and monitoring — this will also significantly reduce uncertainties before the release. Decoupling deployment from releases also lets software engineers be more focused on the development and less on the planning of releases together with stakeholders.
4. Involve software engineers in the on-call rotation
I know, you are probably thinking, “I won’t be responsible for servers, databases, and other architectural stuff, I’m a software engineer!” But here is the key: Empower software engineers to deploy autonomously with trunk-based development and their features. It’s great to speed up deployments but it is also an act of trust. Software engineers are more likely to keep the attention of deployment aspects if they are responsible and accountable for that.
Involving software engineers in the on-call rotation is highly beneficial to raise awareness of the consequences of their technical choices and share the pain with Ops engineers at 3 a.m. More conscious software engineers will be witnesses and ambassadors in their team to prevent incidents and bad technical choices.
5. Implement just culture and blameless postmortems
Complex systems are difficult to fully understand and change safely. Failures and incidents will always happen but, how the organization will respond to them is the essence of its culture. Blaming people as soon as an incident occurs and being brutal against failures will lead to fear and issues hidden to avoid consequences.
Embrace failures and learn from them — it’s the foundation for the growth of an organization in any department. When a team applies a shared responsibility for problems and mistakes are seen as opportunities for growth, everyone can learn to prevent future failures without being afraid for every critical task that is asked to perform.
Blameless postmortems are a great tool to implement just culture. After every important incident, it’s important to have a safe place to analyze the mistake by focusing on the timeline, failures, and mistakes without blaming individuals for their actions. Instead, try to identify why that mistake was likely to happen in the first place.
People involved are encouraged to share their experiences and thoughts safely and propose possible countermeasures based on the analysis made. The outcomes will be tracked and implemented to be sure that the incident produced positive learning and improvements have been introduced to prevent it from happening again. After every postmortem, the incident documentation should be shared and accessible as widely as possible to make the local learnings global.
6. Provide self-service access to telemetry
A data-driven approach is how Ops engineers work every day collecting and analyzing metrics to improve the infrastructure, but it’s also highly adopted in other teams. Collecting technical data historically wasn’t a big deal for organizations. You can produce tons of system and application logs and other metrics, the real challenge is aggregate and make these metrics and numbers available to everyone.
As soon as you start creating charts and dashboards and perform analysis on those representations, you should make it available to everyone inside the department or even in the entire organization. At Cloud Academy, we installed monitors in the entire office with product and engineering metrics, so everyone can see the average response time, CPU utilization, requests count, etc. In this way, our infrastructure health and performance are not an Ops prerogative, but every software engineer can read data and see if something is happening.
DevOps must be guided and supported by organizational changes. Merely applying technical actions could work in a short period of time and bust a few processes, but eventually, it will bump into the organizational rigidity. The best results come from Ops and Devs working together, sharing knowledge, and supporting each other.
New Content: Platforms, Programming, and DevOps – Something for Everyone
This month our team of expert certification specialists released three new or updated learning paths, 16 courses, 13 hands-on labs, and four lab challenges! New content on Cloud Academy You can always visit our Content Roadmap to see what’s just released as well as what’s coming soon....
New Content: Focus on DevOps and Programming Content this Month
This month our team of expert certification specialists released 12 new or updated learning paths, 15 courses, 25 hands-on labs, and four lab challenges! New content on Cloud Academy You can always visit our Content Roadmap to see what’s just released as well as what’s coming soon. Ja...
New Content: Get Ready for the CISM Cert Exam & Learn About Alibaba, Plus All the AWS, GCP, and Azure Courses You Know You Can Count On
This month our team of intrepid certification specialists released five learning paths, seven courses, 19 hands-on labs, and three lab challenges! One particularly interesting new learning path is Certified Information Security Manager (CISM) Foundations. After completing this learn...
New Content: AWS Terraform, Java Programming Lab Challenges, Azure DP-900 & DP-300 Certification Exam Prep, Plus Plenty More Amazon, Google, Microsoft, and Big Data Courses
This month our Content Team continues building the catalog of courses for everyone learning about AWS, GCP, and Microsoft Azure. In addition, this month’s updates include several Java programming lab challenges and a couple of courses on big data. In total, we released five new learning...
Using Docker to Deploy and Optimize WordPress at Scale
Here at Cloud Academy, we use WordPress to serve our blog and product/public pages, such as the home page, the pricing page, etc. Why WordPress? With WordPress, the marketing and content teams can quickly and easily change the look & feel and the content of the pages, without rein...
New Content: AWS Data Analytics – Specialty Certification, Azure AI-900 Certification, Plus New Learning Paths, Courses, Labs, and More
This month our Content Team released two big certification Learning Paths: the AWS Certified Data Analytics - Speciality, and the Azure AI Fundamentals AI-900. In total, we released four new Learning Paths, 16 courses, 24 assessments, and 11 labs. New content on Cloud Academy At any ...
New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More
This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...
Docker Image Security: Get it in Your Sights
For organizations and individuals alike, the adoption of Docker is increasing exponentially with no signs of slowing down. Why is this? Because Docker provides a whole host of features that make it easy to create, deploy, and manage your applications. This useful technology is especiall...
Constant Content: Cloud Academy’s Q3 2020 Roadmap
Hello — Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...
New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More
This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...
New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More
This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...
New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More
This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...