Features of FinOps
The FinOps LifeCycle
The course is part of this learning path
As spending on the public cloud is increasing globally, companies are looking for ways to reduce cost and increase efficiency. Financial Operations, or FinOps, is similar to DevOps, which enables companies to accelerate technology delivery. FinOps is a new operating model that maximizes the value of an organization's cloud investment.
In this course, you are going to learn about FinOps Principles and how to build FinOps Teams, as well as the three phases of the FinOps Lifecycle. Specifically, you will learn how to apply FinOps processes and practices to reduce rates and avoid unnecessary cloud costs.
If you have any feedback on this course, please get in touch with us at firstname.lastname@example.org.
- Understand what makes the cloud so powerful and why it is changing how businesses operate
- Understand what makes cloud challenging from a technology, management, and financial perspective
- Learn about the six FinOps Principles and how to build successful FinOps Teams
- Learn about FinOps capabilities and how to build a common language within your organization
- Learn about the anatomy of a cloud bill and how to take advantage of the Basic Cloud Equation
- Learn about the three phases of the FinOps Lifecycle and how to build successful processes and practices to reduce rates and avoid cost
This course is for engineers, operations, and Finance people looking to understand how to improve efficiency and reduce cost in the cloud.
to get the most out of this course, you should have a foundational understanding of cloud concepts, specifically how compute and storage are provisioned and billed in the cloud. Some familiarity with rate reduction and cost avoidance methods in the cloud would also be helpful but are not essential.
In the Optimize Phase, we target, define, and document optimization opportunities. The Optimize Phase makes use of the FinOps Principles number 2: "The business value of cloud drives decisions", number 5: "A centralized team drives FinOps", and number 6: "Take advantage of the variable cost model of the cloud".
Our goal is to use the basic cloud equation of Usage times Rate equals Cost to reduce cloud spend. Our primary levers during the Optimize Phase are Rate Reduction and Cost Avoidance and this section is going to walk you through the details of how to do that. A prerequisite from the Inform Phase is that we are able to identify workload owners. Without being able to pinpoint specific usages and being able to quantify potential savings the recommendations we share will not be actionable.
Let's start with Rate Reduction which is, with a few exceptions, done centrally by the FinOps team. The tools here are Enterprise Discount Agreements, Private Pricing Agreements, Prepayment Products, and Spot or preemptible service offerings.
Enterprise Discount Agreements are legal contracts between a cloud provider and a customer that specify a minimum spend on cloud services over a specific time duration in return for a discount. For example, from the Amazon Web Services documentation, we see that a volume discount of more than 10 percent is offered for an annual commitment of over 10 million Dollars. Generally, the discount is proportional to the commitment and the requirements may vary between cloud providers.
Cloud financial forecasting will have a critical role in determining what financial commitment can be applied over which time duration. Executives, Finance, and Legal will need to review the proposal and sign off on it. Typically if a commitment is not met by a customer, the remaining financial amount is due at the end of the commitment duration. For example, if a customer commits to spending 12 million dollars a year but only spends 10 million dollars, the remaining 2 million dollars will need to be paid to the cloud provider at the end of the year.
Private Pricing Agreements, also called Rate Cards, are service-specific contracts with a minimum spend over a specific time duration in return for a discount. For example when a customer commits to spending 4 million dollars a year on a specific cloud service the cloud provider may offer a 20 percent discount in return.
It is possible for a customer to have Enterprise Discount Agreements and Private Pricing Agreements at the same time. As Enterprise Discount Agreements are renewed, the cloud provider may offer to incorporate some of the Private Pricing Agreements in the new agreement.
It is also possible to change or to adjust both Enterprise Discount Agreements and Private Pricing Agreements by so-called amendments. These amendments may change the initial commitment or they can adjust the verbiage only without changing the commitment. And amendments will need to be reviewed and signed off by the cloud vendor and the customer's Legal department at the minimum.
The FinOps team will need to reach out to Finance to find out how the discounts are accounted on the customer side. For example, discounts can be applied directly to cloud usage as a rate reduction, or discounts can be aggregated in a special budget that is managed separately. How these discounts are applied will also affect the reports the FinOps team built in the Inform Phase.
Next, let's talk about Prepayment Products which are also managed centrally by the FinOps team. This will take advantage of the cloud spend as a whole by balancing the needs of all teams and not just optimizing individual teams in isolation. Amazon Web Services offers Savings Plans and Reserved Instances for multiple services, Google Cloud Platform offers Sustained Use Discounts and Committed Use Discounts, while Microsoft Azure offers Reservations.
These prepayment products offer a discount in return for a usage commitment over a time duration. For example, a customer may purchase twenty AWS RIs for m5.large in North Virginia for a duration of one year for a discount of around 38 percent. This means that the customer will pay for one year of this usage whether they use the service or not.
To get started, the FinOps team appoints a person to own all prepayment purchases. That person establishes a purchasing cadence, for example, once a month, and makes regular, small, and non-controversial purchases to build muscle around this process.
Following the Crawl, Walk, Run methodology the appointed person monitors how much of the existing prepayment products are being used, so-called Utilization, and how much of the list price usage is covered by prepayment products, which is called Coverage. The idea is to maximize both Utilization and Coverage.
A good starting goal for Utilization is about 80 percent while any percentage for Coverage is better than no Coverage at all. A more advanced goal would be close to 100 percent Utilization and above 90 percent Coverage.
Some Prepayment Products require active management, like Amazon Web Services Convertible Reserved Instances, while other products are self-managed, like Amazon Web Services Savings Plans. The appointed person will need to establish a cadence to regularly perform Modifications and Exchanges of managed Prepayment Products.
For larger purchases, for example, 5 percent or more of the annual cloud spend, the FinOps team will need to reach out to Finance as these purchases may require additional approvals. Think of it this way, someone from the leadership team needs to make a decision if the money is better spent on special projects rather than saving cost in the cloud.
Any upfront cost of Prepayment Products will need to be amortized. The FinOps team will need to align with Finance and Accounting before making purchases. And any amortization will affect the reports the FinOps team built in the Inform Phase.
Now let's look at Spot or preemptible service offerings where management is decentralized within the engineering teams. Preemptible here means interruptible, specifically a virtual machine may be terminated by the cloud provider. Amazon Web Services offers a Spot Market, where virtual machines receive up to 90 percent discount over list price. Google Cloud Platform offers Preemptible virtual machines, and Microsoft Azure offers Low Priority and Spot virtual machines.
The FinOps team has to challenge the engineers and architects to find applications that can take advantage of Spot or preemptible service offerings. This may require code changes to make the application stateless so it can handle terminations by the cloud vendor. Stateless here means that an application stores any transient information outside the virtual machine. For example, user choices or intermediate calculation results need to be stored within a different cloud service, like an object store or a file server.
Container-based applications, like Docker or Kubernetes, are usually good candidates to use Spot or preemptible service offerings as these applications are more likely to be stateless already.
There are many techniques to maximize savings. For example, a broker application can monitor Spot market prices and swap regular virtual machines for Spot instances when the price is low and reverse the process as market prices increase. A similar approach is to stop Spot instances, a form of hibernation, when the price increases and start them up again when the price drops.
Cloud providers are increasing their support for Spot or preemptible service offerings. For example, Amazon Web Services offers Spot for Sagemaker and Fargate, and expanded Auto Scaling to support Spot Fleet. The latter allows engineers to specify an assortment of virtual machine types and sizes to compensate for availability shortages. However, this requires the application to be flexible to use different CPU and memory configurations otherwise any additional resources will be unused which can result in a higher cost.
This concludes Rate Reduction methods, now let's look at Cost Avoidance methods. Specifically, I am going to talk about Cloud Parking, Waste Reduction, Right-sizing, and Re-Architecting.
Cloud Parking takes advantage of the elasticity in the cloud by turning off resources when they are not needed. This can be done manually or on a schedule. Another option is to look at patterns over the last 4 weeks and build schedules automatically. A simple example is to turn on development workloads during business hours. For example, turning virtual machines on for 12 hours 5 days a week results in cost avoidance of 64 percent.
Whenever you use automation to turn off workloads, you will need to provide engineers with an easy way to turn the workloads back on or even to be excluded from the automation. The goal is to provide guardrails instead of gatekeepers to promote the speed of innovation.
Next, let's look at Waste Reduction. Cloud waste are resources that have been requested but are not fully used or not used at all. This is similar to forgetting to turn off the lights after leaving a room. Cloud providers do not natively turn off resources when they are not in use. This would be similar to a sensor that turns off the lights when no one is in a room.
Almost every cloud service will incur a cost when not in use with some exceptions of serverless offerings. The FinOps Foundation Github repository has an extensive list of so-called waste sensors, essentially code examples that surface potential savings opportunities.
Using the Crawl, Walk, Run approach the FinOps team can start communicating potential waste by sharing spreadsheets with leadership and engineers. To make recommendations actionable waste needs to be attributed to workload owners and the potential savings opportunity needs to be quantified in a financial number.
The next step is to put the financial number in context of the workload by establishing a waste percentage KPI and provide historic trending of the KPI. For example, instead of reporting that an application has 1,000 Dollars in waste, it is more impactful to show that the waste of an application has been steadily growing over the last three months and is now at 20 percent which amounts to 1,000 Dollars of potential savings.
Some workloads will not be able to improve their utilization or reduce waste. For example, applications that are statically provisioned with overhead to handle usage spikes. There may not be an upgrade path that will utilize auto-scaling or the software is nearing end-of-life. For these cases, the FinOps team will need to build an exception process to exclude these from a waste report or dashboard.
Any exceptions should be made available in a waste exception report and the exception granted should be periodically reviewed. This will remind leadership and engineers of the wasteful workloads and eventually lead to a replacement or decommissioning of them.
Right-sizing is a special case of waste where a workload is provisioned on a cloud resource that is not an ideal fit. Think of someone renting a six-bedroom house but it is just one person and a cat. If the person doesn't have to pay the rent, they will continue to live in a house that is too large for them.
Right-sizing is a widespread issue that often requires additional tooling to gain visibility into how much was actually used of the size that was requested.
Let's look at containers, also called pods, spelled P-O-D-S. They are configured using a requested and a maximum size for CPU and memory. The requested size is the minimum resource with which the container will be started. The maximum just means that if the container ever grows beyond that size, it will be automatically terminated. The FinOps team will need to build additional tooling to collect data about how much of the requested size was actually used. For example, an engineer could configure a container to have a requested size of one gigabyte of memory, but only use one hundred megabytes, or 10 percent.
Even serverless cloud offerings are affected by right-sizing. This is because a serverless resource can be incorrectly sized for the duration of the usage. For example, Amazon Web Services Lambda functions incur a cost based on how much memory was requested over how much time. The amount of CPU assigned to a Lambda function is proportional to the amount of memory requested. It is possible to give a Lambda function very little memory and have it running for a long time which increases cost. We can avoid cost by giving the Lambda function a little more memory which will result in a shorter run time which reduces cost.
Next within the Optimize Phase is Re-Architecting. This means that the current configuration of a workload is not taking advantage of the elasticity property of the cloud or not fully using cloud resources. Good examples are when a workload uses expensive third-party software licenses, doesn't auto scale, or the disaster recovery scenario requires partially or fully scaled up resources in the failover region.
For example, even modern software like Apache Cassandra doesn't automatically scale depending on user demand. In fact, even scaling Apache Cassandra manually is somewhat of a challenge. Migrating to a cloud-native NoSQL database is a common solution path in this situation.
Another common scenario are disaster recovery scenarios that need partially or fully scaled resources. The latter is often called a pilot light, as a minimal footprint receives data during regular operations which can be scaled up during a failover situation.
Both scenarios can be implemented relatively quickly but are not ideal from a cost and efficiency perspective. A better method is to use an active-active approach, where production workloads are taking traffic in both regions, and scale automatically to accommodate failover scenarios. While this method requires more engineering work, it has the benefit that it can be tested on a regular basis. This will increase confidence in the solution and also has the benefit that it can handle usage spikes more gracefully.
To summarize the FinOps Optimize Phase: We use Rate Reduction and Cost Avoidance to improve efficiency and reduce cost. Rate Reduction methods are Enterprise Discount Agreements, Private Pricing Agreements, Prepayment Products, and Spot or preemptible service offerings. And Cost Avoidance methods are Cloud Parking, Waste Reduction, Right-sizing, and Re-Architecting.
Dieter Matzion is a member of Intuit’s Technology Finance team supporting the AWS cost optimization program.
Most recently, Dieter was part of Netflix’s AWS capacity team, where he helped develop Netflix’s rhythm and active management of AWS including cluster management and moving workloads to different instance families.
Prior to Netflix, Dieter spent two years at Google working on the Google Cloud offering focused on capacity planning and resource provisioning. At Google he developed demand-planning models and automation tools for capacity management.
Prior to that, Dieter spent seven years at PayPal in different roles ranging from managing databases, network operations, and batch operations, supporting all systems and processes for the corporate functions at a daily volume of $1.2B.
A native of Germany, Dieter has an M.S. in computer science. When not at work, he prioritizes spending time with family and enjoying the outdoors: hiking, camping, horseback riding, and cave exploration.