Managing & Investigating Service Incidents on GCP
Managing and investigating service incidents is an important part of the maintenance process. It is a necessity that can be laboring but with the right organization, understanding of the systems, the knowledge of processes, and the discipline to adhere to best practices, it can be optimized. This course will focus on the predominant parts of managing service incidents and utilizing Google Cloud Platform to aid in the endeavor.
Perhaps the most important aspect of managing service incidents is managing the personnel involved. With that comes the need to manage their roles and responsibilities. This course will discuss the strategy for managing such roles and effectively managing the team. Part of managing the team is having a process for turnover of team members; managing the workload of the team, developing and scaling a reporting structure, and maintaining team productivity.
Perhaps the second most important aspect of managing service incidents is establishing effective communication. Constant and effective communication within the team and external to the team is paramount. This is especially true for keeping stakeholders informed.
The course will also discuss tooling to aid in monitoring and incident resolution, specifically Google Cloud Platform’s Stackdriver service. The service makes investigating service incidents easier by giving the response team the information needed.
If you have any feedback relating to this course, please contact us at email@example.com.
- Understand how to handle personnel to aid incident response
- Learn how to manage roles within a team
- Learn how to investigate incidents effectively
This course is suited to anyone wanting to learn about incident handling using Google Cloud Platform.
- An active Google Cloud Platform account with admin permissions in order to administer roles, create test infrastructure, and configure operational tooling
- A good understanding of managing service issues
- Knowledge of issue mitigation practices
- An understanding of logging and monitoring concepts
- High-level knowledge of how roles should interact
Incident and request management can have a broad scope. This involves handling request for impact assessment, incident resolution process, providing status updates, documenting processes and state, and establishing clear communication.
In this video, we will explore each of these areas. Often when there is a rollout of a new product or implementation, incident or issue with the system and assessment of what effect it has on other services is warranted. Handing requests for impact assessment is an essential task but can be manageable. Each request is typically unique but it can follow the same process.
Gathering information about the event is the first step. What type of event it is, when will or did it happen, what systems are connected, and how long the event is expected to last are some examples of the type of information needed.
Once the information is gathered, scoping should be done on the system to determine what should be assessed from those to be impacted. Assessment is done to determine the likely impacts and to identify areas of concern.
Reporting the findings is the next step and will be used to provide support for the event. The incident resolution process formalizes the stages that should take place to efficiently resolve incidents. It is important to have a system for queuing and tracking incidents such as Kanban boards, service management tools, or a white board, just something to give people reference to what is happening with the incident.
The other part of the request handling is the process in which requests are handled. The process can be broken down into five phases, identification, coordination, resolution, closure, and continuous improvement. Identification of the issue can be done either through automated or manual reporting. Coordination of efforts is essential so that work can be streamlined and help prevent duplicate or counterproductive work.
Resolution involves gathering information about the incident, assessment of the issue, what resources are affected, limiting damage, fixing the issue, correcting any affected systems, and communicating status.
Closure is an important aspect of the process. It allows time to reflect on the events, what we learned, what was well done, and what needs improvement. Continuous improvement uses closure to formulate new strategies and prepare training.
Status updates are essential in keeping the immediate team, external teams, stakeholders, and other interested parties informed about an incident. There are many ways to distribute the information. Some of the popular methods include using a specialized tool, chat apps, planning boards, email, and even the phone. Status updates should be done as frequent as possible even if there's nothing to report. This keeps everyone informed and will keep inquiries at bay.
Documenting the process and the state for major changes in the state of the system helps to establish trends that may be present, and it may be helpful to know exactly when the state changed. An organization may have its own standards for recording state changes. Some items to include are timestamp of the state change, the state after the change, actions or events that may have led to the state change, and any relevant notes about the event.
Perhaps the single most important aspect of managing incidents is communication. Communication is essential for coordinating efforts, informing on status, and keeping everyone calm by letting them know what actions are being taken. Today there are many ways to communicate among teammates, eternal teams, external teams, company-wide or the public.
Choosing a platform or platforms that can deliver the message to the intended audience with as little difficulty as possible but with raw ability shouldn't be very difficult with the wealth of options out there today. Let's run down some of these choices. Chat apps like Slack, Microsoft Teams, and Skype are good for team communication. Kanban, SCRUM, or Trello boards keep tabs on work done. Email notifications can be a good option for some organizations. Using a phone for one-on-one or a small audience is great for immediacy. An audible message system that can play back recordings could be used. A website that has the ability to update content easily is also a good choice. Many service tools have internet tracking, messaging, and notification systems. Whatever the method or methods chosen, the messages still need to be put on the wire. This could be from automation, an individual, or a team. But even the best messaging system is useless if not utilized.
Stakeholders are an important part of any project. Keeping stakeholders informed and having a good relationship with them will help keep the project running smoothly. There are some things to keep in mind that may help when dealing with stakeholders. Know who the stakeholders are and what role they play. It's also a good idea to know which stakeholders need to be kept up-to-date constantly. Establish a trust relationship with stakeholders. This is something that needs to be built over time. Be open with the information when communicating. Being transparent will go a long way in establishing trust. Have a good idea of what is important to which stakeholders and try to meet those mandates.
Receive feedback with an open mind. Listen to the stakeholders' concerns and suggestions. When issues arise, be open about options to resolve the issue. Make good on promises. If a promise can't be met, make sure to communicate that to the stakeholders and explain why early rather than late.
Cory W. Cordell is an accomplished DevOps Architect, Software Engineer, and author. He started his DevOps career as a DevOps Engineer for a large bank where he helped implement DevOps practices and tooling and establish a DevOps culture.
Cory then accepted a position with a global firm to build a DevOps department. He led a team of DevOps Engineers to establish best practices and train development teams on tooling and those practices. He worked to help development teams migrate their applications to Azure Kubernetes Service and establish pipelines to build, test, and deploy code. Realizing that a substantial gap existed in the toolchain, he developed an application to aid in infrastructure tracking and to provide UI abilities for teams to view application status for their software.
Cory is now enjoying working as a contractor and author.