Designing Storage solutions in AWS - Level 2
Amazon S3 and Glacier
1h 14m

This course covers the core learning objective to meet the requirements of the 'Designing Storage solutions in AWS - Level 2' skill

Learning Objectives: 

  • Understand AWS storage services that can be used with hybrid or non-cloud-native applications
  • Evaluate which storage services can scale to accommodate unpredictable future needs
  • Understand the different performance and cost options available for different storage types, including block, object and file storage

Hello and welcome to this lecture. Let me start by providing a high-level overview of Amazon S3 and Glacier. Amazon S3 is a fully managed object-based storage that is highly available, highly durable, very cost-effective, and widely accessible. S3 is also promoted as having unlimited storage capabilities making this service extremely scalable, far more scalable than your own on-premise storage solutions could ever be.

Amazon Glacier is an extremely low cost, long-term, durable storage solution which is often referred to as cold storage, ideally suited for long-term back up and archival requirements. It's capable of storing the same data types as Amazon S3 effectively any object. However, it does not provide instant access to your data.

To understand the costs associated with S3 and Glacier I will start by looking at the storage classes available to you when storing data. Over the years there has been a number of changes to these services including an array of additional storage classes.

Each class offers different attributes to fit specific needs, such as the number of availability zones used to store your object data, the minimum storage duration (in days), the minimum billable object size, durability and availability percentages, and retrieval times and fees, etc. As you can see from this chart, as of January 2020 there are 6 different storage classes.

Using this table, we can clearly see the differences between each of them. AWS summarises each of these storage classes as follows: 

  • S3 Standard - General purpose storage for any type of data, typically used for frequently accessed data
  • S3 Intelligent - Tiering - Automatic cost savings for data with unknown or changing access patterns
  • S3 Standard - Infrequent Access  - For long-lived but infrequently accessed data that needs millisecond access
  • S3 One Zone - Infrequent Access - For re-creatable infrequently accessed data that needs millisecond access
  • S3 Glacier - For long-term backups and archives with retrieval option from 1 minute to 12 hours
  • S3 Glacier Deep Archive - For long-term data archiving that is accessed once or twice in a year and can be restored within 12 hours

For those curious with the ‘automatic cost-saving’ feature of intelligent tiering, let me provide a high-level overview of its operation. Depending on data access patterns of objects in the intelligent-tiering class, S3 will move the objects between two different tiers, these being, frequent and infrequent access. These tiers are a part of the intelligent tiering class itself and are separate from existing classes. When objects are moved to intelligent tiering, they are placed within the ‘frequent access’ tier. If an object is not accessed for 30 days, then AWS will automatically move that object to the cheaper tier, known as the 'infrequent access tier’. Once that same object is accessed again, it will automatically be moved back to the ‘frequent tier’.

When selecting your class for your data you need to ask yourself the following questions which will help you identify which storage class you should be using, and in turn, help you identify the most cost-effective option: How critical is the data? How reproducible is the data, so can it easily be created again if need be? What is the access pattern of the data likely to be? Will latency a factor when accessing the object? Answering these questions will help you to establish which options are viable and which are not.

Another point to bear in mind is that some of these storage classes have a tiered pricing structure, what do I mean by this? Well for the Standard and S3 Intelligent storage class your costs vary depending on how much data is stored within a single month. As you can see from the table below (based on the London region) for the Standard storage, your price is reduced as you add more and more data within the same month.

A similar pricing structure exists for the Intelligent-tiering too as shown, specifically for the frequent access tier, but take note, you are also charged a monitoring and automation fee which covers the cost for intelligently and automating the movement of objects between different classes.

For all other storage classes, a flat rate exists regardless of how much you store, this can be seen here.

Using the right storage class is one way to optimize your costs, however, reviewing the profile of your data might not be the only factor when optimizing the cost of your storage. You should also be familiar with request and data retrieval costs, data transfer costs and also management and replication costs. So let’s take a look at each of these in a bit more detail and how they impact your costs, starting with request and data retrieval costs.

Firstly, let me look at the request costs. Requests can be split into the following main request types:

  • PUT
  • COPY
  • POST
  • LIST
  • GET
  • Lifecycle Transition

These requests themselves are split into 3 categories, each with a different price point, but all are costed per 1000 requests. This table shows the request costs from the London region for each storage class. Do bear in mind that both the DELETE and CANCEL requests are free. 

Now from a retrieval perspective, the costs are very different, specifically when we look at S3 Glacier and S3 Glacier Deep Archive. Instead of basing the costs on per 1000 requests, the cost is associated on a per-gigabyte basis. As you can see in the table there are no retrieval costs for the S3 standard and intelligent tiering classes and $0.01 cost per GB for Infrequent access and one-zone infrequent access. However, when we get to the Glacier classes, the costs are expanded depending on which data retrieval method you use.

Expedited. This is used when you have an urgent requirement to retrieve your data but the request has to less than 250 MB. The data is then made available to you in one to five minutes.

Standard. This can be used to retrieve any of your archives no matter their size, but your data will be available in three to five hours. So it takes much longer than the Expedited option. 

Bulk. This option is used to retrieve petabytes of data at a time. However, this typically takes between five and twelve hours to complete. This is the cheapest of the retrieval options. So it really depends on how much data and how quickly you need it as to the retrieval speed and cost to you made by your retrieval option.

Provisioned Capacity Unit. Finally, this option allows you to pay an upfront fee within a month at a fixed cost to expedite data retrievals from Glacier vaults. This is typically used if you have a lot of data stored in Glacier and you are planning to perform a greater than usual amount of retrievals within a given month within a quick retrieval time. A similar breakdown exists for Glacier Deep archive as well for its Standard and Bulk retrieval fees.

Data Transfer Costs. It can be difficult to understand what data transfer costs will be incurred when storing data on Amazon S3, and it really depends on how and where you are transferring data to and from. Let me try and break it down for clarity.

Firstly, let’s look at when it’s free to transfer data into and out of Amazon S3 (Please note this excludes S3 Transfer Acceleration, which I will cover separately). Data transfer is free when:

  • Data is transferred INTO Amazon S3 From the internet.
  • Data is transferred OUT to your EC2 instances which reside in the same Region as the source S3 bucket in which the data is located.
  • Data is transferred OUT to Amazon CloudFront.

When looking at transferring data OUT, the following costs apply when transferring out to the internet (again, this is taken from the London region). As you can see, the more you transfer out per month, the cheaper and more cost-effective the data transfer rates become. When transferring out to other AWS services, other than Amazon CloudFront, which is free, costs are charged at a per-gigabyte rate. For London, this is a flat rate across all services as of January 2020 at $0.02 per gigabyte.

Transfer Acceleration. When we look at Transfer acceleration, the pricing structure for transfer costs changes and this is largely due to the fact that your data is routed through an optimized network path to Amazon S3 via CloudFront edge locations.

Whereas normal data transfer into amazon S3 is free from the internet, with transfer acceleration, this is a cost associated per gigabyte dependant on which edge location is used. Also, there is an increased cost for any data transferred OUT of S3, either to the internet or to another Region, again due to the edge location acceleration involved.

Management and Replication. The final element of cost association with Amazon S3 and associated Glacier tiers relates to Management and replication of your data. From an S3 management perspective, there are three different features that if enabled on your bucket have an associated cost for each. These features are:

  • Amazon S3 inventory (Used for auditing and reporting for replication and encryption actions). This feature is priced per million objects listed.
  • Analytics, which is used to analyze access patterns to assist in ensuring you are using the right storage class for your objects. This feature is priced per million objects monitored per month.
  • Object tagging, which is using tags allows you to categorize your storage. This feature is priced per 10,000 tags applied per month.

The pricing shown reflects the current London region pricing at the time of recording this course.

S3 Batch Operations. Batch operations allow you to carry out management operations across millions or even billions of your S3 Objects at the same time using a single API or by using the S3 Management Console.

Batch operations also integrate with AWS CloudTrail to monitor all changes made using the APIs selected. It also includes the ability to notify you when specific events occur and provide a completion report keeping you aware of the progress of your batch changes.

Being able to run huge batch management across your data storage in S3 can save you a huge amount of time by trying to develop other alternate methods in trying to achieve the same result. It’s also compatible with AWS Lambda, allowing you to run your functions across billions of objects at once.

Pricing for this feature has two price points, firstly on per batch job, and secondly per million object operations performed.

S3 and Glacier Select. S3 and Glacier select is available with all storage classes except Glacier Deep Archive. Select allows you to use SQL expressions to retrieve only the data that you want from your objects instead of the whole objects which could be many gigabyes in size. This enables you to retrieve the data faster and cheaper!

Again, there are 2 price points related to Select: data scanned (per GB) and data returned (per GB). And much like when we looked at retrieval costs, Glacier is broken down into the 3 different retrieval modes: Expedited, Standard and Bulk.

Replication. There are 2 different modes for S3 replication: CRR - Cross-Region Replication between 2 different buckets and SRR - Same-Region Replication between 2 different buckets. There are no specific costs for the use of the S3 replication feature itself, instead, you are simply charged for the cost for the storage class in your destination where your replicated objects will reside. You will also incur costs for any COPY and PUT requests which will also be based upon the rates of the destination region. When using Cross-Region replication, there will also be the addition of the inter-region data transfer fees, which will be priced upon the source region.

S3 Replication Time Control. S3-RTC is an advanced feature built on top of S3 replication that provides an SLA of ensuring that 99.9% of objects are replicated within 15 minutes of the start of the upload. This process can be monitored through the use of CloudWatch metrics (which are charged separately), however, the cost to use S3-RTC is currently set at a flat rate across all regions as shown.

Data Management controls. The final point I want to make about Amazon S3 and cost optimization is both Versioning and Life Cycle policies, both of which can have an impact on your overall costs.

Versioning. When you enable versioning on a bucket it allows for multiple versions of the same object to exist. This is useful to allow you to retrieve previous versions of a file or recover from some accidental deletion, or indeed intended malicious deletion of an object. Versioning is not enabled by default, however, once you have enabled it, versioning will be an added cost to you as you are storing multiple versions of the same object and as we know, the Amazon S3 cost model is based on actual usage of storage.

Lifecycle Policies. Amazon S3 lifecycle rules provide an automatic method of managing the life of your data while it is being stored within a particular storage class. By adding a lifecycle rule to a bucket you are able to configure and set specific criteria that can automatically move your data from one storage class to another, or delete it from S3 altogether. You may want to implement these measures as a cost-saving exercise, by moving data to a cheaper storage class after a set period of time. Or perhaps you may only be required to keep some data for a set period of time, for example, 90 days, before it can be deleted, by setting up a life cycle policy you can configure a bucket to automatically delete anything older than 90 days.

Ok, so now we have reviewed the wide variety of charges that can be incurred when using Amazon S3 and Glacier. Some of these you may already be familiar with, some might have been new to you.

The main points of consideration when trying to optimize your storage costs when using these services focus on the following:

  • Storage class (Priced per GB of storage): You need to understand the profile of your storage, its access patterns, its criticality, availability and if latency plays a factor when accessing the data. Understanding more about your data will help you to optimize the most cost-efficient storage class. For example, it would be an unnecessary cost to store your secondary backup of on-premise data to the S3 Standard storage class, instead, it would make more financial sense to use the infrequent access or one-zone infrequent access storage class.
  • Data requests (Priced per 1000 requests): This requires you to again understand the data request patterns of your objects to allow you to gain a comprehensive understanding of your predicted costs. Identify who or what is going to be accessing your data and how often.
  • Retrieval requests (Priced per GB retrieved): This element plays a much bigger factor when utilizing S3 Glacier and Glacier Deep Archive due to the different retrieval methods when using these services. Your chosen method will be very dependant on how quickly you need to retrieve your data, the quicker you need it, the more you will pay.

Sometimes there can be confusion between Data requests, such as a GET request and data retrieval. The key difference here is to understand that a GET Request is simply the process of requesting a file retrieval, priced per 1000 requests, whereas data retrieval is charged per GB of the actual data being retrieved.

  • Data Transfer (Priced per GB): Once your data is in Amazon S3, what do you intend to do with it? Will it simply be kept there for cold storage, for example in Glacier or Deep Archive, or will it be shared across numerous applications and services and transferred out to different regions? If you plan to transfer data from S3 out from your bucket to EC2 instances for example, then ensure you architect your infrastructure so that those EC2 instances or Buckets are in the same region as each other to take advantage of the free data transfer. Understand where your data is being transferred to and from, as this could help you optimize your costs. Before using transfer acceleration, do you actually need it, is it worth the additional cost? Can you architect your infrastructure through the use or regions better to remove the need of using transfer acceleration?
  • Management Operations: S3 offers some great management features, and it can be easy to select these options when storing your data, but ONLY if it’s going to provide a positive impact and financially viable. Again, it comes down to understanding the data profile of your objects.
  • Replication: Depending on your design, service and resiliency needs you may need to adopt a level of replication, and when doing so be mindful of the fact you will be charged for storage and requests made in the destination bucket, plus any additional costs if using replication time control.
  • Data Management controls: Be aware of the additional costs that are incurred when versioning is enabled on a bucket. Make use of lifecycle policies to automatically move or delete your data based upon your own data policies, this could lead to significant savings.

One final note before I finish this lecture, please ensure that you check all the latest prices on the official AWS Amazon S3 pricing page when designing your solutions.



About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.