Managing & Investigating Service Incidents on GCP
Managing and investigating service incidents is an important part of the maintenance process. It is a necessity that can be laboring but with the right organization, understanding of the systems, the knowledge of processes, and the discipline to adhere to best practices, it can be optimized. This course will focus on the predominant parts of managing service incidents and utilizing Google Cloud Platform to aid in the endeavor.
Perhaps the most important aspect of managing service incidents is managing the personnel involved. With that comes the need to manage their roles and responsibilities. This course will discuss the strategy for managing such roles and effectively managing the team. Part of managing the team is having a process for turnover of team members; managing the workload of the team, developing and scaling a reporting structure, and maintaining team productivity.
Perhaps the second most important aspect of managing service incidents is establishing effective communication. Constant and effective communication within the team and external to the team is paramount. This is especially true for keeping stakeholders informed.
The course will also discuss tooling to aid in monitoring and incident resolution, specifically Google Cloud Platform’s Stackdriver service. The service makes investigating service incidents easier by giving the response team the information needed.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Understand how to handle personnel to aid incident response
- Learn how to manage roles within a team
- Learn how to investigate incidents effectively
This course is suited to anyone wanting to learn about incident handling using Google Cloud Platform.
- An active Google Cloud Platform account with admin permissions in order to administer roles, create test infrastructure, and configure operational tooling
- A good understanding of managing service issues
- Knowledge of issue mitigation practices
- An understanding of logging and monitoring concepts
- High-level knowledge of how roles should interact
Throughout this course, we have explored many topics as it relates to managing service incidents. Let's do a brief review.
We started with learning about role context and defining authoritative roles. There are two role contexts that we've talked about, authoritative and IAM. IAM roles are roles that are defined in GCP to provide permissions.
Authoritative roles are assigned for governance over an area. There can be many IAM and authoritative roles. Other than incident commander, there is no set formula for what is required since role definition is greatly dependent upon structure and need.
Authoritative roles can expand as need requires so that the workload is manageable. The incident commander can assign roles of assistant incident commander, communications lead, operations lead, or any other role needed. These authoritative roles will need access in GCP, in the form of IAM roles. There can be a lot of variation in the types of IAM roles that are needed. So, it may be necessary to create new roles or modify current roles as needs change.
As the incident response team grows, it may be difficult to manage personnel role on and off of the team. Groups can help with assigning permissions. Instead of having to assign one or more roles to each member, roles can be assigned to groups and then team members assigned to the groups. Incident and request management can have a broad scope. This involves handling requests for impact assessment, the incident resolution process, providing status updates, documenting processes and state, and establishing clear communication.
Being able to handle personnel effectively is essential. This could involve, scaling the team, or rotating personnel. Balancing workload includes expanding the team and downsizing when necessary. Knowing the signs of burnout and having strategy in place to help the team avoid exhaustion, is essential in staying productive. Also, having a plan for onboarding new team members will enable quicker contribution.
Investigating an incident is a vital part of the maintenance process. Findings can help discover vulnerabilities, prevent similar incidents, and enhance user service. This includes identifying probable causes of a service failure, ranking probability based on observed behavior, performing investigation to isolate likely cause, and identifying alternatives to mitigate the issue.
Congratulations on finishing the course. I wish you success on your coming endeavors.
Cory W. Cordell is an accomplished DevOps Architect, Software Engineer, and author. He started his DevOps career as a DevOps Engineer for a large bank where he helped implement DevOps practices and tooling and establish a DevOps culture.
Cory then accepted a position with a global firm to build a DevOps department. He led a team of DevOps Engineers to establish best practices and train development teams on tooling and those practices. He worked to help development teams migrate their applications to Azure Kubernetes Service and establish pipelines to build, test, and deploy code. Realizing that a substantial gap existed in the toolchain, he developed an application to aid in infrastructure tracking and to provide UI abilities for teams to view application status for their software.
Cory is now enjoying working as a contractor and author.