Managing & Investigating Service Incidents on GCP
The course is part of this learning path
Managing and investigating service incidents is an important part of the maintenance process. It is a necessity that can be laboring but with the right organization, understanding of the systems, the knowledge of processes, and the discipline to adhere to best practices, it can be optimized. This course will focus on the predominant parts of managing service incidents and utilizing Google Cloud Platform to aid in the endeavor.
Perhaps the most important aspect of managing service incidents is managing the personnel involved. With that comes the need to manage their roles and responsibilities. This course will discuss the strategy for managing such roles and effectively managing the team. Part of managing the team is having a process for turnover of team members; managing the workload of the team, developing and scaling a reporting structure, and maintaining team productivity.
Perhaps the second most important aspect of managing service incidents is establishing effective communication. Constant and effective communication within the team and external to the team is paramount. This is especially true for keeping stakeholders informed.
The course will also discuss tooling to aid in monitoring and incident resolution, specifically Google Cloud Platform’s Stackdriver service. The service makes investigating service incidents easier by giving the response team the information needed.
If you have any feedback relating to this course, please contact us at firstname.lastname@example.org.
- Understand how to handle personnel to aid incident response
- Learn how to manage roles within a team
- Learn how to investigate incidents effectively
This course is suited to anyone wanting to learn about incident handling using Google Cloud Platform.
- An active Google Cloud Platform account with admin permissions in order to administer roles, create test infrastructure, and configure operational tooling
- A good understanding of managing service issues
- Knowledge of issue mitigation practices
- An understanding of logging and monitoring concepts
- High-level knowledge of how roles should interact
IAM roles are tangible roles defined in Google Cloud Platform. They group a set of permissions that can be given a name. Permissions follow a naming convention of service.resource.verb. For instance, bigquery.jobs.list contains the service, big query, the resource, jobs, and the verb, list.
There are many predefined roles that were created by the Google Cloud Platform team that exist and are ready to use. If none of the predefined roles suit the needs of efforts then a custom role can be defined and armed with permissions.
Let's walk through creating a custom role. I've opened my browser and have traversed to Google Cloud Platform. I first need to ensure that the correct project is set to the desired project in the dropdown at the top. Cloudacademy in this case. Then I'll click on the menu button and select IAM. Down towards the bottom of the side menu is the roles menu. Clicking the roles menu item will open the roles page. There are a lot of predefined roles listed, and likely there is one that fits what is needed, however, let's create one.
Clicking the create role button at the top of the roles list will open the create role page. I'll give the role a name of Service Incident Engineer. There are a few restrictions on the characters available and the name is limited to one hundred characters. Next, a descriptive name for the role would be helpful. No character limitations are present and the character limit is 256 characters. I'll enter a description of "A role defined for incident engineers". The ID can only include letters, numbers, periods and underscores and is limited to 30 characters. I'll enter the role name in Pascal Case.
Next, a launch stage can be selected from the dropdown. Options of alpha, beta, general availability, and disabled exist and allow for gradual rollout of a new role. I'll leave the stage set to alpha. Finally, we can choose the permissions that are grouped under the role.
Clicking the add permissions button brings up a modal. There is a filter permissions by role dropdown at the top. This can be used to narrow the many permission options. Clicking the button opens an additional modal. The text field at the top can be used to search for a role by name or partial name. This is useful if you know of certain permissions that are needed and are already part of a known role.
I'll enter "stack" in the search box. The result is several permissions that belong to the stackdriver service. I'll select "stackdriver accounts editor" and then click off of the modal to close it. We can see that the list of permissions have been narrowed to just those that are part of the stackdriver accounts editor role.
Notice there are three columns: a checkbox, the permission name, and a status. The only one that appears applicable is the "resource.projects.get" permission, so I'll choose that one. Clicking add will close the modal and add the permission to the list of permissions for the custom role.
We now see that "No assigned permissions" has changed to "1 assigned permission". Clicking the blue checkbox beside the permission that was just added deselects the permission from the list, effectively revoking that permission. This makes it easy to deselect permissions and then select them again without having to search for them.
The table of permissions also has its own filter box. The question mark button allows for further filtering options and the button beside the question mark contains options for additional table headings. Clicking the "Show added and removed permissions" button will show what permissions are currently added to the role.
A majority of the time, more than one permission is warranted. More permissions can be added to the list by clicking the add permission button once again and repeating the process. Once satisfied with the configuration, clicking the create button at the bottom will finalize the new role and clicking cancel will discard the progress. I want to save these changes so I'll click the create button.
The roles list page is displayed once again. I'll use the filter box to search for the new role. We can see that the role has been added. A role can be edited to change some of the details about the role definition. Clicking on the ellipse beside the role and then selecting the edit option will open the edit page. The title, description, stage, and permissions can be changed.
New permissions can be added with the add permissions button and permissions can be removed by removing the check mark beside the permission to remove. Selecting update will save the changes and return to the roles page. I'll select cancel and then confirm to go back to the roles page without saving the changes.
Clicking on the ellipse beside the role brings up options. More options are present on custom roles than that of predefined roles. Predefined roles can't be edited, deleted, or disabled but they can be used for a starting point of creating a new role. Often, it is desired to have similar permissions to an existing role but with a few more or less permissions. Once again, I'll use the filter box to search for our custom role.
Clicking the ellipse beside our custom role gives us options of using this role as a starting point for a new role, disable, delete, and edit as you've seen. I'll choose Delete to remove this role. The role's status has been changed to deleted.
Cory W. Cordell is an accomplished DevOps Architect, Software Engineer, and author. He started his DevOps career as a DevOps Engineer for a large bank where he helped implement DevOps practices and tooling and establish a DevOps culture.
Cory then accepted a position with a global firm to build a DevOps department. He led a team of DevOps Engineers to establish best practices and train development teams on tooling and those practices. He worked to help development teams migrate their applications to Azure Kubernetes Service and establish pipelines to build, test, and deploy code. Realizing that a substantial gap existed in the toolchain, he developed an application to aid in infrastructure tracking and to provide UI abilities for teams to view application status for their software.
Cory is now enjoying working as a contractor and author.