Managing & Investigating Service Incidents on GCP
The course is part of this learning path
Managing and investigating service incidents is an important part of the maintenance process. It is a necessity that can be laboring but with the right organization, understanding of the systems, the knowledge of processes, and the discipline to adhere to best practices, it can be optimized. This course will focus on the predominant parts of managing service incidents and utilizing Google Cloud Platform to aid in the endeavor.
Perhaps the most important aspect of managing service incidents is managing the personnel involved. With that comes the need to manage their roles and responsibilities. This course will discuss the strategy for managing such roles and effectively managing the team. Part of managing the team is having a process for turnover of team members; managing the workload of the team, developing and scaling a reporting structure, and maintaining team productivity.
Perhaps the second most important aspect of managing service incidents is establishing effective communication. Constant and effective communication within the team and external to the team is paramount. This is especially true for keeping stakeholders informed.
The course will also discuss tooling to aid in monitoring and incident resolution, specifically Google Cloud Platform’s Stackdriver service. The service makes investigating service incidents easier by giving the response team the information needed.
If you have any feedback relating to this course, please contact us at email@example.com.
- Understand how to handle personnel to aid incident response
- Learn how to manage roles within a team
- Learn how to investigate incidents effectively
This course is suited to anyone wanting to learn about incident handling using Google Cloud Platform.
- An active Google Cloud Platform account with admin permissions in order to administer roles, create test infrastructure, and configure operational tooling
- A good understanding of managing service issues
- Knowledge of issue mitigation practices
- An understanding of logging and monitoring concepts
- High-level knowledge of how roles should interact
Any team will inevitably have turnover. This is to say that at some point new team members will come on board, and existing team members will leave after varying lengths of time. This could be for many different reasons. New team members may be required due to an expansion in workload, team members could move to another team, leave the organization, the team could have a rotating schedule, or just simply need a break from a demanding environment.
In this video, we will look at scaling the response team and delegation of duties, avoiding exhaustion for team members, and rotating personnel. When an incident becomes too much work to manage for a response team, scaling the team may provide a more immediate resolution and relieve some of the load across the team. Often when a new team member comes on board, there's an acclimation period, especially if that person has never worked with the team.
Let's go over a transition checklist. The new team member should feel welcome and comfortable with the team in order to kickstart the acclamation process. Introducing the team and identifying roles is a good start. Ensure that the new member has a good understanding of what is expected. Make sure the new member knows who to turn to for help with various issues. Key members of the team, such as team leads and team members that the person will work closely with, should spend some extra time with the new member. Spending extra time up front will help get the new member up to speed. Grant the new member the necessary permissions and add the person to the communication channels.
Now that additional team members have been added to the team, it is especially important to have good communication and delegation. Having a larger team can be a detriment to progress rather than bolstering productivity if not managed properly. For each new team member added, the complexity of managing communication and duties is impacted exponentially. For that reason, it is vital that new leadership roles are created if needed so that the reporting chain is more manageable.
It may be necessary to restructure the team roles to balance the team and distribute the workload. No matter how much we love what we do, at some point exhaustion sets in. Recognizing the signs of exhaustion is the first step in heading it off. Most often a person's demeanor changes to a more negative outlook. The person may have a reduced interest in doing a once-loved job, having trouble getting enough sleep, being agitated easily, not being able to feel satisfied, feeling restlessness and anxious. Coping with exhaustion or burnout is vital for the individual team members and for the team. Reduce fatigue by taking scheduled rests or break periods, such as a 15-second break every 15 minutes and a five-minute break every one or two hours.
Exercise is a good way to reduce stress and clear one's mind. Engaging socially can be a welcomed distraction. Reduce stress levels with proven stress relief techniques. If exhaustion gets too severe, a break from the project or immediate duties may be required.
Team members rolling off and on projects is fairly common. Frequent personnel changes can cause productivity to stagnate if not managed properly. Have plans in place for onboarding new members and role changes. Things such as getting new members the proper permissions, making sure they know what their duties are, etc.
When changing a role, if possible, have the person on the way out spend some time with the person rolling into the role. If not possible, have another team member debrief with the vacating team member so that information can be relayed. Good documentation around the role's activities is always warranted.
Cory W. Cordell is an accomplished DevOps Architect, Software Engineer, and author. He started his DevOps career as a DevOps Engineer for a large bank where he helped implement DevOps practices and tooling and establish a DevOps culture.
Cory then accepted a position with a global firm to build a DevOps department. He led a team of DevOps Engineers to establish best practices and train development teams on tooling and those practices. He worked to help development teams migrate their applications to Azure Kubernetes Service and establish pipelines to build, test, and deploy code. Realizing that a substantial gap existed in the toolchain, he developed an application to aid in infrastructure tracking and to provide UI abilities for teams to view application status for their software.
Cory is now enjoying working as a contractor and author.