image
Managing and Investigating Service Incidents on GCP
Course Introduction
Difficulty
Intermediate
Duration
37m
Students
272
Ratings
4.6/5
Description

Managing and investigating service incidents is an important part of the maintenance process. It is a necessity that can be laboring but with the right organization, understanding of the systems, the knowledge of processes, and the discipline to adhere to best practices, it can be optimized. This course will focus on the predominant parts of managing service incidents and utilizing Google Cloud Platform to aid in the endeavor.

Perhaps the most important aspect of managing service incidents is managing the personnel involved. With that comes the need to manage their roles and responsibilities. This course will discuss the strategy for managing such roles and effectively managing the team. Part of managing the team is having a process for turnover of team members; managing the workload of the team, developing and scaling a reporting structure, and maintaining team productivity.

Perhaps the second most important aspect of managing service incidents is establishing effective communication. Constant and effective communication within the team and external to the team is paramount. This is especially true for keeping stakeholders informed.

The course will also discuss tooling to aid in monitoring and incident resolution, specifically Google Cloud Platform’s Stackdriver service. The service makes investigating service incidents easier by giving the response team the information needed.

If you have any feedback relating to this course, please contact us at support@cloudacademy.com.

Learning Objectives

  • Understand how to handle personnel to aid incident response
  • Learn how to manage roles within a team
  • Learn how to investigate incidents effectively

Intended Audience

This course is suited to anyone wanting to learn about incident handling using Google Cloud Platform.

Prerequisites

  • An active Google Cloud Platform account with admin permissions in order to administer roles, create test infrastructure, and configure operational tooling
  • A good understanding of managing service issues
  • Knowledge of issue mitigation practices
  • An understanding of logging and monitoring concepts
  • High-level knowledge of how roles should interact
Transcript

Welcome to the Managing and Investigating Service Incidents on Google Cloud Platform course. This video will introduce you to the course, but first, allow me to introduce myself. My name is Cory W. Cordell, and I'll be your instructor for this course. I've worked as a Cloud DevOps Engineer and Architect on both Greenfield and existing DevOps integrations for small to large companies across the globe, in an effort to establish DevOps culture and best practices. And working with cloud-based services for the past six years, I've experienced firsthand the trials and growth of several platform providers as they continue to develop new offerings and solutions. I look forward to accompanying you on your endeavor to learn more about the Google Cloud Platform.

This course is for anyone who wishes to utilize Google Cloud Platform and learn more about managing service incidents. With audience in mind, the course is designed to teach you about some of the GCP service incident management concepts in order to efficiently navigate and mitigate those incidents. Such as role management and communication and identifying and evaluating service issues.

Let's look at the course learning objectives. In this course, you will learn about strategizing roles to achieve effective and efficient resolution. Handling service requests for impact assessment. Keeping people informed with status updates. Recording incident state changes. Communication channels for notification. Response team scaling and delegation. Avoiding burnout. Rotating roles. And stakeholder relationship management.

To follow learning in this course you will need: An active Google Cloud Platform or GCP account. Admin permissions on the account in order to administrate roles, create test infrastructure, and configure optional tooling. A good understanding of managing service issues. Knowledge of issue, mitigation practices. Understanding of logging and monitoring concepts. And a high-level knowledge of how roles should interact.

If you need help for any reason, please contact support@cloudacademy.com. After completing, don't forget to rate the course.

Lectures

Role Context and Defining Authoritative Roles - Defining IAM Roles - Defining Groups - Incident and Request Management - Personnel Management - Course Conclusion

About the Author

Cory W. Cordell is an accomplished DevOps Architect, Software Engineer, and author. He started his DevOps career as a DevOps Engineer for a large bank where he helped implement DevOps practices and tooling and establish a DevOps culture. 

Cory then accepted a position with a global firm to build a DevOps department. He led a team of DevOps Engineers to establish best practices and train development teams on tooling and those practices. He worked to help development teams migrate their applications to Azure Kubernetes Service and establish pipelines to build, test, and deploy code. Realizing that a substantial gap existed in the toolchain, he developed an application to aid in infrastructure tracking and to provide UI abilities for teams to view application status for their software.

Cory is now enjoying working as a contractor and author.