Managing & Investigating Service Incidents on GCP
Managing and investigating service incidents is an important part of the maintenance process. It is a necessity that can be laboring but with the right organization, understanding of the systems, the knowledge of processes, and the discipline to adhere to best practices, it can be optimized. This course will focus on the predominant parts of managing service incidents and utilizing Google Cloud Platform to aid in the endeavor.
Perhaps the most important aspect of managing service incidents is managing the personnel involved. With that comes the need to manage their roles and responsibilities. This course will discuss the strategy for managing such roles and effectively managing the team. Part of managing the team is having a process for turnover of team members; managing the workload of the team, developing and scaling a reporting structure, and maintaining team productivity.
Perhaps the second most important aspect of managing service incidents is establishing effective communication. Constant and effective communication within the team and external to the team is paramount. This is especially true for keeping stakeholders informed.
The course will also discuss tooling to aid in monitoring and incident resolution, specifically Google Cloud Platform’s Stackdriver service. The service makes investigating service incidents easier by giving the response team the information needed.
If you have any feedback relating to this course, please contact us at email@example.com.
- Understand how to handle personnel to aid incident response
- Learn how to manage roles within a team
- Learn how to investigate incidents effectively
This course is suited to anyone wanting to learn about incident handling using Google Cloud Platform.
- An active Google Cloud Platform account with admin permissions in order to administer roles, create test infrastructure, and configure operational tooling
- A good understanding of managing service issues
- Knowledge of issue mitigation practices
- An understanding of logging and monitoring concepts
- High-level knowledge of how roles should interact
In this video, we will take a look at how Roles play a role in effectively managing service incidents.
First, I'd like to take a moment to clarify the context of the term role. There are a couple of different contexts for the term role that will be used, I-A-M role and authoritative role. Google Cloud Platforms Identity and Access Management system often referred to as IAM, has a component within it named role.
A role is a little more than a name list of permissions that can be assigned to one or more IAM accounts or groups in order to easily grant those permissions. There are three types of IAM roles, primitive, predefined, and custom. The primitive roles of owner, editor, and viewer are from a time that predates to curated roles; predefined and custom.
The curated roles allow for more precise control over permissions on resources. We will walk through these roles throughout the course, but we will also use the term role in more of a liberal sense. Meaning, roles that are not strictly defined in IAM but are used as a reference to authority over an area of concern. This type of role is a name set of duties and responsibilities that are assigned to individuals who are likely members of a group that is focused on an overarching goal.
For example, a developer is responsible for an application while an operations engineer is responsible for infrastructure.
In summary, IAM roles refer to permissions over resources and GCP, and authoritative roles referred to responsibility over an area. Even though these role contexts are distinct, often an authoritative role will need a corresponding IAM role or roles to perform the duties necessary to effectively manage service incidents.
Often dealing with service incidents is too much for one person to handle. Defining roles and assigning governance and responsibility to those roles is of great importance for efficiency and effectively managing service incidents. Roles also help to narrow focus for each person and makes communication within the team and external to the team for given topics more streamlined. It also helps that there are designated roles for touching parts of the system.
Role definitions start with the incident commander. The incident commander is in charge of a service incident and has the authority to delegate responsibility to other roles and make decisions. From here, any number of roles and types of roles can be instantiated as needed, and it doesn't need to be done all at once.
Roles can expand and shrink as demands and responsibility scale. It's a good idea to establish a set of core roles and then scale as needed. One of those that may be considered is an operations lead. An Operations Lead collects details about the incident and can make changes to the state of the system or service. Such as running commands on the system to determine cause, gathering log files and metrics, or implementing fixes.
Another important role is a Communications Lead. A Communications Lead helps notify people that may be interested in the incident. This could be to coordinate efforts within the team or maybe to update affected parties outside the team. They can also help get the word out in the event that a specialist may need to be brought in from another part of the organization. If the incident is complex, then the Incident Commander can be inundated with requests, input, decisions, and so forth.
In the event that this is the case, an assistant Incident Commander may need to be utilized. An assistant can be delegated authority over a subset of roles and responsibilities. In fact, more than one assistant can exist and the workload split. Each role would report up to the Assistant Incident Commander responsible for that role and then each Assistant Incident Commander would report relevant data up to the Incident Commander.
Another important role is a subject matter expert. There can be multiple of these roles since it is likely that expertise will be needed for more than one area. Subject matter experts are usually in the trenches with the people that are working hands-on. They can be of great help when accessing and repairing service interruptions and issues.
Other roles can be defined as needed. For instance, there may need to be a person in charge of coordinating meetings, scheduling meeting rooms, and gathering provisions. No matter how many roles are used, the incident commander is in charge. The Incident Commander is the one person everyone looks to for direction and decision making. Authoritative roles are great for separating areas of concern, but what about roles that help get work done in the system? Let's look at IAM roles next.
Cory W. Cordell is an accomplished DevOps Architect, Software Engineer, and author. He started his DevOps career as a DevOps Engineer for a large bank where he helped implement DevOps practices and tooling and establish a DevOps culture.
Cory then accepted a position with a global firm to build a DevOps department. He led a team of DevOps Engineers to establish best practices and train development teams on tooling and those practices. He worked to help development teams migrate their applications to Azure Kubernetes Service and establish pipelines to build, test, and deploy code. Realizing that a substantial gap existed in the toolchain, he developed an application to aid in infrastructure tracking and to provide UI abilities for teams to view application status for their software.
Cory is now enjoying working as a contractor and author.