Designing cost optimized storage solutions


Designing Cost-Optimized Architectures

The course is part of these learning paths

Designing cost optimized storage solutions

In this module, we will first introduce the concepts of cost optimization and how AWS compute services can be selected and applied to optimize costs. We will review the various instance types available, and how the purchasing options available can be selected and combined to provide a cost-optimized solution.

Next, we review how we can optimize storage costs by selecting the appropriate storage services or storage classes to create the most optimal and economical way to store objects and data in AWS cloud storage.


Hello and welcome to this lecture on how to design cost-optimized storage. AWS provides many storage services and one of the requirements of the Solution Architect Associate Exam is being able to select the right service to deliver the very best optimized solution. So I'm gonna try to be as practical as possible to help you gain those skills. So let's assume we are working for a company, an online business which provides a portal for customers to request and have fulfilled legal services such as contracts and agreements. is facing a storage exhaustion issue in their current data center and the decision has been made by the executive to migrate their applications and data services to AWS to increase elasticity, scalability, and durability.

So we have 160 terabytes of scanned legal documents which we want to shift to cloud storage and our first task is to determine what is the best service to use and secondly, how we'd go about transferring these objects to the cloud. Should we use an object store or should we use Amazon RDS? Should we use DynamoDB or perhaps even Amazon Aurora? Now the key requirement here is that we need to store unstructured objects of various sizes and formats. These are scanned documents and they also include TIFF files and JPEG files and all sorts of different formats. And Amazon S3 would be an ideal first solution for storing these scanned documents as it provides highly durable, highly available elastic object storage. We don't need to provision the Amazon S3 storage level first and the services will scale upwards to accommodate any future volumes of data that we need to provide. So we wouldn't choose a relational database like RDS Service for storing objects like this as a relational database would be unlikely to be efficient and the cost might outweigh the benefit of using a relational database on a cloud service. We don't have a lot of structured requests around this data.

The same would go for DynamoDB. DynamoDB stores structured data and is indexed by primary key and so it allows low latency read and write access to items ranging from one byte up to 40 kilobytes. Now S3 stores unstructured blobs and are suited for storing large objects up to five terabytes. So in order to optimize the costs across AWS Services, large objects or infrequently accessed data sets should be stored in Amazon S3. So let's explain the options we have for object storage in AWS and how we can apply these options to meet specific storage use cases. 

Amazon S3 is a key-based object store. So essentially, you can store and retrieve any type of object in Amazon S3. Image files, videos, documents, archives. It scales quickly and dynamically so you don't need to pre-provision space for objects if you store them. There are three storage classes for Amazon S3 and each provides a level of availability, performance, and cost which makes it a highly flexible solution. So let's learn to recognize and explain the three storage classes so we can determine which to apply to any given scenario we are presented with in an exam. 

So first up is Amazon S3 Standard Class. Now this storage class provides the maximum durability and availability combination of the storage classes and so it suits most use cases. In fact, it's hard to find a use case that doesn't suit Standard Class Amazon S3 storage. And so it does make an ideal place to start when building a new service or migrating objects to the cloud. So with the decision to use Amazon S3 made, what would be the optimal way to import our scanned documents to Amazon S3? Copying 160 terabytes of data over a T1 connection is gonna take a very long time which is unlikely to suit our migration plan. Now there's a number of formulas out there you can apply to calculate just how long it would take to copy data across an internet connection of various speeds. But immediately as an architect, we need to see that this is a potential delay as a lot of network bandwidth and time will be required to copy that size archive. 

So this is our first optimization point with storage. Now we should look to import the data using AWS Snowball devices of 80 terabytes each and keep the on-premise store synchronized using the Amazon S3 sync command once we've imported those archives to Amazon S3 storage using the Snowball device. We can upload objects of up to five gigabytes in size in a single operation with the command line tool. And if we have objects that are greater than five gigabytes that we need to synchronize or copy, then we could use the multi-part upload feature. The Snowball devices take five to seven days to transfer and be delivered so it's a very quick and efficient way of transferring large data sets. 

All right, so with that done, once we have our system running, we can start to look at ways of optimizing our storage further and that's where I'd like to introduce the Infrequent Access Storage Class. Now this storage class is ideal for warm data. If Amazon S3 Standard Class is great for let's call it hot data, then Infrequent Access Class is ideal for warm data. We still need the durability, but the availability in retrieval time is not so mission critical. So Infrequent Access Storage Class suits archived versions of files or non-critical backup data sets, transcoded media files for example. The cost per gigabyte is cheaper than Standard Class S3, but there is a per-gigabyte storage retrieval cost for Infrequent Access Class. So if we need to retrieve assets regularly, Standard Class would end up being probably more efficient. 

All the storage classes support lifecycle rules which is an ideal way to shift older versions of our legal documents to Infrequent Access Class Storage. So as an example we may have versions of our legal documents that aren't the primary version. The primary version we'll keep stored in Amazon S3 Standard Storage Class. But for those older versions which we don't need to recover quickly and which we can happily wait a period of time to recover in which we are unlikely to recover on a regular basis, then shifting those to Infrequent Access Storage Class would make a lot of sense. 

Okay, so that's the first cost saving we could have. And as our data volumes grow, the cost, of course, will continue to go up. So if we can reduce our cost by shifting 30 or 40% of our 160 terabytes to Infrequent Access Storage Class, we can save money. Now as our migration project progresses and our experience grows, we would more than likely consider shifting older archives directly to Infrequent Access Storage. Now it's often the first entrance point to using Amazon S3 is Standard Storage Class because it's easiest. But once we have more experience with it, we would consider importing it directly to Infrequent Access Storage Class. 

Now the third storage class is Amazon Glacier and that essentially is cold storage or cold data. And Glacier suits archives and taped library replacements and anything where we may need to keep a version of a backup or an archive to meet compliance requirements. And objects in Glacier have a relatively long return time so when we request something from Amazon S3 Standard Class it's instant. It's very, very fast. It's highly available. With Infrequent Access Class, it's less availability. It's something that we're not gonna be doing a lot of. And with Glacier, it can take two to five hours to request an archive be retrieved. So it's perfect for anything where we don't have a time constraint. Anti-patent for Glacier is anything where we need to recover data quickly so it doesn't suit warm standby or active active recovery models or anything where we need to recover things with a time constraint or as quickly as possible. Now you can use lifecycle rules to shift assets from Standard or Infrequent Access storage classes to Amazon Glacier. If we have older backups and taped archives, we might wanna consider importing those directly to Glacier and also setting up rules to shift older assets to Glacier after a period of time. 

Okay, there's one more storage class which is the One Zone Infrequent Access Storage Class. In April 2018, AWS introduced Amazon S3 One Zone Infrequent Access Storage Class. Now the One Zone Infrequent Access Storage Class provides nine 9s durability. However, the One Zone IA lets you store objects in a single availability zone which comes in at significantly less than Amazon S3 Standard IA, currently at around 20% less. So Amazon S3 Standard, S3 Standard Infrequent Access, and Amazon Glacier all distribute data across a minimum of three geographically separated availability zones so that gives us the highest possible level of resilience. 

Now the S3 One Zone Infrequent Access saves you cost by storing infrequently accessed data in a single availability zone. So Amazon S3 Standard IA is a great choice for long-term storage of anything that is infrequently accessed, whereas Amazon S3 One Zone IA provides a lower price point for any other infrequently accessed data such as duplicates of backups or data summaries or anything that can be regenerated. 

As for choosing the right storage class after you have been running your environment for a while, there is a really cool Amazon S3 tool that can help which is called Storage Class Analysis. Storage Class Analysis Tool helps you observe your data or access patterns over time and gathers information to help you improve the lifecycle management of your Standard and Standard IA storage classes. So after you configure a storage class analysis filter, the analysis tool observes the access patterns of your filtered data sets for 30 days or longer to gather information for analysis before giving you a result. And the analysis continues to run after the initial result and updates the result as the access patterns change, which is really useful. 

Okay, a few other optimizations for us to consider. Amazon S3 provides excellent speed at low latency. So if your typical workload involves only occasional bursts of 100 requests per second, you really don't need to look at other optimizations for speeding up any performance. There are a couple of things to consider, however. Transfer Acceleration is one and it's best when you're submitting data from distributed client locations over the public internet. So Transfer Acceleration improves the performance of transfers across the public internet. So when we have variable network conditions that would make throughput poor, Transfer Acceleration can really help. So if you need to upload to a centralized bucket from various locations around the globe or if you're transferring large volumes of data across continents on a regular basis, Transfer Acceleration can improve that performance for you. Transfer Acceleration supports all bucket level features including multi-part upload. The service does attract a small additional charge. 

So if we're importing our 160 terabytes of scanned legal documents, should we use Transfer Acceleration rather than AWS Snowball? Well, Snowball is ideal when we need to move large batches of data all at once. So no, we wanna use Snowball at first because it allows us to shift all that data. While Transfer Acceleration can speed up the transfer rate, the actual time it will take to move such a large archive is going to be insurmountable and will be of a negative impact on our project. 

Now the Snowball has a typical five to seven-day turnaround time. So for our scenario, we wanna use both. We perform our initial heavy lift with two 80 gig Snowball devices and then we can use Transfer Acceleration on our incremental ongoing sync tasks. Now, what about CloudFront? Would that speed things up? Should we choose Transfer Acceleration or CloudFront if we're shifting objects around? So Transfer Acceleration optimizes the TCP protocol and adds additional intelligence between the client and the Amazon S3 bucket. So that makes Transfer Acceleration a better choice if maximum throughput is what we're after. If you do have objects that are smaller than a gigabyte or if a data set is less than one gigabyte in size, then we should consider using Amazon CloudFront's put or post commands to get optimal performance. 

Now let's imagine if we have a Direct Connect service in place. Would we use Transfer Acceleration over that? Well, we gotta keep in mind that Direct Connect uses a private connection. So Transfer Acceleration is best for submitting data from distributed client locations over the public internet where variable network conditions make throughput poor. Now it is possible to use Transfer Acceleration over Direct Connect, but the two are slightly different remember because Direct Connect is a private connection. Transfer Acceleration is designed to help optimize transfers across the public domain. You can use Transfer Acceleration without the storage gateways and third-party services, so you can configure at any bucket destination to be used by a third party gateway and that can use Amazon S3 Transfer Acceleration endpoint domain names and it will give some benefit. 

So the reality is Amazon S3 provides excellent speed at low latency and if your typical workload involves only occasional bursts of 100 requests per second or more. But if you are routinely processing 100 or more requests per second, there are some things you can do to optimize object store performance. If the bulk of the workload consists of get requests, then using Amazon CloudFront will improve performance. So if you're just getting a lot of get requests, then using CloudFront as a distributed CDN will improve performance and take some of the weight or load off your Amazon S3 bucket. If your requests are typically a mix of get, put, delete, or get bucket, i.e. if you need to list objects in a bucket as quickly as possible, then choosing appropriate key names for your objects ensures the best performance as it provides low latency access to the Amazon S3 index. 

Now a common process we like or I like to use is including the date and time in the key name, but it tends to be counterproductive to how Amazon S3 stores and retrieves objects and here's why. S3 keeps an index of object key names in each AWS region. So object key names are stored in a UTF-8 binary ordering across multiple partitions in that index and the key name dictates which partition the key will be stored in. So when we use a sequential prefix such as a timestamp or an alphabetical sequence, that increases the likelihood that S3 will target a specific partition for a large number of your keys and that can overwhelm the I/O capacity of that partition, which can impact performance. So if you can introduce some randomness in your key name prefixes, the key names and therefore the I/O load will be distributed across more than one partition. 

Now one way to introduce randomness to key names is to add a hexadecimal hash string as a prefix to the key name. So you can, for example, compute an MD5 hash of the character sequence that you plan to assign as the key name, and then from the hash pick a specific number of characters and add them to the prefix to the name. Using the appropriate key names also ensures scalability regardless of the number of requests you send per second. And this only applies, of course, if your workload is consistently exceeding 100 requests per second. If you don't have this type of volume, then it's not gonna be an issue. However, as a best practice, you should avoid using sequential key names. 

Adding randomness improves performance, but it does make it a bit harder to return ordered lists with commands such as get bucket so a good practice is to add more prefixes to your key names before the hash string so you can group objects together. That means ordered lists returned by the get bucket operation will be grouped by the prefixes you've added. Now you can also reverse the order of your sequences to improve retrieval further. 

Okay, that completes our storage optimization lecture for the Solution Architect Associate Exam. There are, of course, many more facets to optimization which we can explore in subsequent lectures once you've aced the exam. If you want to extend your knowledge now, just search for migration or scenario on the Cloud Academy website. Okay, see you in the next lecture.

About the Author
Learning Paths

Andrew is fanatical about helping business teams gain the maximum ROI possible from adopting, using, and optimizing Public Cloud Services. Having built  70+ Cloud Academy courses, Andrew has helped over 50,000 students master cloud computing by sharing the skills and experiences he gained during 20+  years leading digital teams in code and consulting. Before joining Cloud Academy, Andrew worked for AWS and for AWS technology partners Ooyala and Adobe.