Managing Cost Drivers


Course Introduction
Course Conclusion
3m 3s
Start course
1h 15m

Cloud computing providers like Amazon Web Services, Microsoft Azure, and Google Cloud Platform are becoming a larger part of our IT budget, making it necessary to understand their cost. We may even be surprised to see public cloud bills to be higher than expected. I am going to take a closer look at the top contributors and what we can do to reduce overall spending while maintaining innovation velocity.

In this course, you'll learn what makes the cloud such an attractive solution, what drives cloud adoption, and what are the typical costs of cloud computing are. You'll learn about a wide range of cloud cost optimization techniques, the best practices for cost management, and how to gamify the cloud cost experience.

If you have any feedback relating to this course, please let us know at

Learning Objectives

  • Understand what makes cloud attractive and how adoption will drive cost
  • Learn how to gain visibility into cloud cost and how to hold departments accountable for their spending
  • Learn about cloud cost drivers and how to get the most out of your budget
  • Discover how to establish best practices and build a culture of cost-consciousness

Intended Audience

This course is for executives, architects, and technical leads looking to understand what drives public cloud cost and to learn about best practices of cloud cost optimization.


To get the most out of this course, you should have a basic understanding of cloud concepts. Some familiarity with cloud services like compute and storage would also be helpful but is not required.


Welcome back to our cost-optimization strategies for the cloud course. I am Dieter Matzion and I will be your instructor for this lecture. In this lecture, we're going to talk about managing cost drivers. We will learn about cost-optimization activities that need to be performed on a regular basis to reduce cloud spending. I'm going to start with a few simple ones and work my way to more sophisticated methods. Unlike in a data center, where infrastructure incurs cost whether it is used or not, the elasticity property of the cloud permits you not to pay for assets when not in use. However, the responsibility of turning assets off is yours, and until you do so, your cloud provider will keep charging you.

The most basic way to reduce cost is to identify unused assets and return them to your cloud provider. For example, unused virtual machines keep incurring a cost until you terminate them. You will need to build a list of the most common cloud services your business uses and determine for which of them you are being charged. A manual approach is to use your cost-reporting infrastructure to periodically generate a list of unused resources and to terminate the ones no longer needed.

A more efficient solution is to use automation to perform this cleanup activity for you. Another straightforward example are old files no longer needed by your business. In the data center, your storage will fill up and your engineers will build cleanup scripts. In the cloud, with virtually unlimited storage, this need disappeared. However, your cloud provider still needs to maintain your old files and will charge you for them every month. While this cost will be minimal at the beginning of your cloud journey, it can grow to the point where spending some engineering cycles on cleanup is offset by the savings.

Compared to a data center, you may be relatively unaware of some cloud services incurring cost when not in use, like, for example, block storage and their snapshots. Virtual block storage is a raw, block-level storage that can be used like a virtual disk, including formatting and mounting. Higher level operations like cloning and snapshotting are typically supported. Block storage can be unattached, meaning the virtual disk is not used by any virtual machine, but it will still incur a cost. Snapshots taken from block storage volumes will also incur a cost until they are deleted.

As before, you can use a manual approach to identify departments with the highest cost of unused block storage and later use automation to back up and delete unused volumes and snapshots. Keep in mind that these so-called zombie resources cost you money. You want to get rid of them. Cloud providers offer numerous options for computing storage and they maintain these offerings in multiple regions distributed around the globe.

The freedom of choice the cloud offers can result in your developers spreading their choices a little too much. For example, without oversight or governance, you may see workloads utilizing a variety of virtual machines and storage options in a large number of regions. We call this effect resource fragmentation, and it will add complexity to cost optimization and potentially add unnecessary challenges to cost reduction efforts.

For example, vendor volume discounts or prepay options may be specific to certain compute or storage options, making it more difficult to take advantage of them. It is best practice to somewhat restrict these choices for your developers without impeding the velocity of innovation. Your organization may want to put forth a guideline specifying what compute and storage options are available to your developers for common workloads and allow exceptions for more specific use cases. This guideline needs to also include a primary and secondary region recommendation.

The secondary region to be used for high availability and disaster recovery, or HADR for short. For example, using Amazon Web Services, you can establish a standard that includes the latest instance families, like M4, C4, I3, and R4, and alpha-2 regions as the standard regions for deployment. This will limit the effort the cost optimization team needs to perform on a regular basis and allow you to maximize benefits from discounts. The next cost-optimization activities requires somewhat more involvement from your developers, as services will need to be restarted or migrated.

As your business grows and evolves in the cloud, utilizing modern design patterns and technologies showed us the infrastructure of your cloud provider evolve as well. Cloud services that were state of the art a few years ago will be updated or replaced. Applications deployed on virtual machines in the past may be running on older generation hardware.

In my experience, virtual machines running on more modern hardware will be generally cheaper and more performant. In a nutshell, there's really no reason not to migrate your application to the latest generation virtual machines. Using your reporting infrastructure, you need to identify applications on older generation virtual machines and work with your developers to migrate them to the latest generation. This will typically require going for an additional release cycle, which may potentially not have been planned for in the current release schedule. Partner with your development leads to raise awareness around this need and have them drive these cost-optimization efforts.

Even applications running on latest generation virtual machines may have been over-provisioned, meaning they are only utilizing a fraction of the compute or storage, however, your business keeps paying for the full size allocated. Unless you have this capability already, you may need to expand your cloud reporting to include utilization metrics on a more detailed level, like CPU, memory, disk, and network bandwidth, to get a more complete picture of how hot or cold your workloads are running.

Cloud providers offer their services in a somewhat flexible style, typically allowing you to scale vertically by selecting bigger or smaller virtual machines. Right-sizing is the process where you match your compute and storage with the demand of your workloads. To right-size your workloads, developers will need to go through the same process as with migrating to latest generation virtual machines. Use elasticity to turn off workloads that are only required a fraction of the time.

For example, MapReduce clusters needed only a few hours a day, or user acceptance testing clusters needed a few days before a major release. Also consider using preemptive or evictable cloud services for workloads that are not business critical. Consider refactoring workloads that require a lot of headroom to handle peak demand. For example, if your application requires a virtual machine to be provisioned several sizes larger than needed most of the time just to be able to perform at peak, look into redesigning that application to be stateless to take advantage of cloud elasticity by scaling horizontally instead of vertically.

For example, you can store state of your virtual machine in a caching service or a message queue. This will allow you to handle peak demand by adding more of the same type of virtual machines instead of having to over-provision by using larger virtual machines. Cloud providers offer a multitude of storage options from premium storage tiers that have a high throughput with low latency at the high end of the cost spectrum, down to the colder storage tiers at a lower cost.

For example, Microsoft Azure storage offers four regional redundancy levels: locally redundant, zone-redundant, geo-redundant, and read-only geo-redundant, where the geo-redundant option incurs about twice the cost of the locally redundant one. Your developers will need to be made aware of the different options and price differences so they can choose the option that matches the business requirement. Another example is Amazon Web Services Elastic Block Store, which offers four tiers of block-level storage: Provisioned IOPS SSD, also called io1, General Purpose SSD, called gp2, Throughput Optimized HDD, called st1, and Cold HDD, called sc1. The Provisioned IOPS SSD option can be further enhanced by adding input/output operations per second, or IOPS, to the service. The more IOPS are added, the higher the cost of the storage volume will be.

You need to reach out to your developers to educate them on the different storage options and their costs so they can choose the most efficient ones for their workloads. Data goes through a lifecycle from collection or creation, processing or analysis, to storage and, eventually, deletion.

Your business will need to build policies for data retention that comply with regulations of the country where the data is stored. Based on your data retention policy and utilizing your reporting infrastructure, you will need to build processes to migrate data from hotter, more costly storage options to colder, less costly options to reduce the cost of ownership of your data. For example, Amazon Web Services Simple Storage Service, or S3 for short, offers three storage tiers: Standard, Infrequent Access, and Glacier.

The AWS S3 service also offers automated lifecycle management. It allows you to define transition and expiration actions of objects in S3 buckets. In addition, AWS Glacier offers the vault lock policy that allows you to specify controls such as write once read many, or WORM for short, that prevents future edits to enforce compliance controls.

Moving your data from general purpose SSD block storage to Standard S3 results in a cost reduction of about 5X, and moving data from Standard S3 to Glacier gives you another 5X reduction. The total cost drop between SSD and Glacier is, in fact, 25X, something that your business cannot afford to overlook. Engage with your development leads and make them aware of these options so they can include data lifecycle management in their designs.

Before we conclude this lecture, I want to walk you through the cost-optimization activity that requires the most effort in my opinion: refactoring and potentially re-architecting applications to utilize lower-cost-tier cloud services for compute and storage. For example, developers may choose an in-memory storage option for fastest access in the early stages of building applications to ensure a timely delivery to production. However, using large forms of high-memory compute is not a most cost-efficient solution.

Even though cloud providers now offer terabyte memory compute, your application may eventually reach limits of scaling vertically. Once the usefulness of an application has been established, for example, it became part of your company's service offerings, meaning it is not going away any time soon, you need to reassess how that application performs costwise. The goal is to look for ways to reduce the total cost of ownership of that application. Applications that were developed using a more monolithic design may benefit from being redesigned into a more service-oriented architecture, where fine-grained services are loosely coupled by lightweight protocols.

For example, an application relying on a large in-memory database can instead utilize a caching service that ages out all the objects to a persistent storage tier. Your cloud provider had to cater to these needs before, and by now offers a multitude of services that offer synchronous and asynchronous protocols to support an event-driven architecture. For example, Amazon Web Services offers elastic cache that supports two open-source in-memory caching engines: Memcached and Redis.

For intermediate persistency, you may want to evaluate not-so-obvious choices, like Elasticsearch, which allows you to quickly store data in a structured form and integrates with many third-party tools, like Kibana and Logstash, that provide easy data visualization. And, of course, every cloud provider will offer a large array of SQL and NoSQL storage options.

Lastly, I want to make you aware that preemptive or evictable cloud services can also be used for business-critical workloads if your application incorporates termination in its design. One of the higher-visibility customer testimonials is from the San Francisco-based ride-sharing company, Lyft. Their developers state that they were able to reduce their monthly cloud bill by 90% by changing just four lines of code. Independent of that data point, I see a trend of businesses utilizing evictable compute to their advantage. The reason is, the developers excel in overcoming technical challenges. Building solutions with the technology available is part of software engineering history.

If you incorporate cost in your deliverables, developers will come up with innovative solutions that help you to balance cost with revenue. In this lecture, we worked our way from relatively straightforward cost-optimization activities to more elaborate ways to reduce cost that require application design changes. We learned how cost reporting and automation will help you on your cost optimization journey.

About the Author

Dieter Matzion is a member of Intuit’s Technology Finance team supporting the AWS cost optimization program.

Most recently, Dieter was part of Netflix’s AWS capacity team, where he helped develop Netflix’s rhythm and active management of AWS including cluster management and moving workloads to different instance families.

Prior to Netflix, Dieter spent two years at Google working on the Google Cloud offering focused on capacity planning and resource provisioning. At Google he developed demand-planning models and automation tools for capacity management.

Prior to that, Dieter spent seven years at PayPal in different roles ranging from managing databases, network operations, and batch operations, supporting all systems and processes for the corporate functions at a daily volume of $1.2B.

A native of Germany, Dieter has an M.S. in computer science. When not at work, he prioritizes spending time with family and enjoying the outdoors: hiking, camping, horseback riding, and cave exploration.