AWS Data Services
The course is part of these learning paths
One of the core building blocks of Infrastructure as a Service (IaaS) is that of storage, and AWS provides a wide range of storage services that allow you to architect the correct solution for your needs. Understanding what each of these services is and what they have been designed and developed for, gives you the knowledge to implement best practices ensuring your data is stored, transmitted and backed up in the most efficient and scalable way. This course will focus on each of the storage services provided by AWS and will explain what the service is, its key features and when and why you might use the service within your own environment.
The objectives of this course are to provide:
- An overview and introduction to the different AWS storage services
- An understanding of how to transfer data into and out of AWS
- The knowledge to confidently select the most appropriate storage service for your needs
This course is designed as an introduction to the AWS storage services and methods of storing data. As a result, this course is suitable for:
- Those who are starting out their AWS journey to understand the various services that exist and their use case
- Storage engineers responsible for maintaining and storing data within the enterprise
- Security engineers who secure and safeguard data within AWS
- Those who are looking to begin their certification journey with either the AWS Cloud Practitioner or one of the 3 Associate level certifications
This is an entry-level course to AWS storage services and so no prior knowledge of these services are required, however, a basic understanding of Cloud Computing and awareness of AWS would be beneficial but not essential.
If you have thoughts or suggestions for this course, please contact Cloud Academy at email@example.com.
- Blog: AWS Global Infrastructure
Hello and welcome to this lecture, where we'll introduce the Amazon Simple Storage Service commonly known as S3. Amazon S3 is probably the most used storage service that is provided by AWS simply down to the fact it can be used for many different use cases and called upon by many different AWS services.
Amazon S3 is a fully managed object based storage that is high available, highly durable, very cost effective, and widely accessible. S3 is also promoted as having unlimited storage capabilities making this service extremely scalable, far more scalable than your own on-premise storage solutions could ever be. There is however, limitations to the individual size of a single file that it can support. The smallest files size that it supports is zero bytes, and the largest file size is five terabytes. Although there is this size limitation on the maximum file size, it's one that many of us will not perceive as an ongoing inhibitor in the majority of use cases.
When data is uploaded within S3 you as the customer are required to specify the regional location for that data to be placed in. By specifying your region for your data Amazon S3 will then store and duplicate your uploaded data multiple times across multiple available zones within that region to increase both its durability and availability. For more information on regions, availability zones, and other AWS global infrastructure components please visit the following blog page.
Objects stored in S3 have a durability of 11 nines (99.999999999%) and so, the likelihood of losing data is extremely rare and this is down to the fact that S3 stores numerous copies of the same data in different availability zones. The availability of S3 data objects is currently four nines (99.99%). And the different between availability and durability is this. When looking at availability AWS ensure that the up time of Amazon S3 is 99.99% to enable you to access your stored data. The durability percentage refers to the probability of maintaining your data without it being lost through corruption, degradation of data, or other unknown potential damaging effects. When uploading objects to S3, specific objects are used to manage your data.
To store objects in S3 you first need to define and create a bucket. You can think of a bucket as a parent folder for your data. This bucket name must be completely unique, not just within the region you specify, but globally across all other S3 buckets that exist, which there are many millions. Once you have created your bucket name you can begin to upload your data. You can upload your data directly into that bucket or you can, if required, create folders under your bucket to store your data in for easier management. By default, there is a limit of a hundred buckets that you are allowed to create within your AWS account but this can be increased if requested through AWS.
Objects that are then stored in these buckets have a unique object key that defines the object across the flat address space of S3. Although folders can provide additional management from a data organization point of view, Amazon S3 is not a file system and so, specific features of Amazon S3 work at the bucket level rather than specific folder levels. Let's now take a closer look at some of the features offered by Amazon S3, starting with an overview of the different storage classes that are available.
There are a number of different storage classes within S3, all of which offer different performance features and costs. And it's down to you to select the storage class that you require for the data. These classes are as follows. Standard, Standard Infrequent Access, Intelligent Tiering, One Zone Infrequent Access, and Reduced Redundancy. But this Reduced Redundancy option is no longer recommended by AWS, and I'll explain why as we go. It's best to review the differences between these classes from within a table to enable you to understand the key points of difference. Now as you can see, the main differences between the classes is the durability and availability percentages that each class offers. And to help us look at the differences between these classes, we can split them into two different categories, data that is accessed frequently and data that is accessed infrequently.
For data that is accessed frequently, you have two options, either Standard or Reduced Redundancy Storage, RRS. The default storage class option for your objects is a Standard class unless you specify otherwise. The Reduced Redundancy Storage class is an older storage class option and is no longer recommended by AWS, as the Standard Storage class is now more cost effective and offers greater durability. For data that is accessed infrequently, we've looked at Standard-IA and One Zone-IA. Although they offer the same speed of access as Standard, these infrequent access storage classes charge an additional cost to retrieve and access the data. So the data in these classes are typically long live data objects that require very little need to be accessed. The difference between One Zone-IA compared to Standard-IA is the durability aspect. One Zone-IA does not replicate its data across multiple availability zones, and so only offers a 99.5% availability SLA. With this in mind, it should only be used for data that can be reproduced. Between Standard-IA and One Zone-IA, One Zone-IA is the cheapest due to its limitation on a single availability zone.
Finally, we have the Intelligent Tiering Storage class, which is great for unpredictable access patterns. And this has the ability to optimize your cost of your storage by moving data objects between different tiers based upon its usage. Intelligent Tiering will move data between two tiers, a frequently accessed tier and a more cost effective infrequent accessed tier. If data has not been accessed for 30 days or more then Intelligent Tiering will move data into the Infrequent Accessed tier. The next time this object is accessed, it will be moved back into the Frequently Accessed tier and the 30-day timer will be reset. There are no retrieval costs for your data like there is with Standard-IA and One Zone-IA. However, do be aware that there is a small monthly cost associated to each object that is monitored by the Intelligence Tiering class. And each object must be larger than 128 kilobytes.
So before selecting your class for your data, you really need to be asking yourself the following questions. How often is the data likely to be accessed? How critical is my data? How reproducible is the data? Can it be easily created again if need be? And do I know the access patterns of my data? When looking at the data within your bucket, the S3 console will also show you which storage class that the object belongs to. Amazon Glacier is another storage class. However, it is also a separate service to Amazon S3. There are interactions between the two services where S3 allows you to use lifecycle rules to move data from S3 to Glacier as the Standard storage class for archival purposes. But I shall cover more on Amazon Glacier in the next lecture.
Access control lists. Access control lists or ACLs are again another method of controlling who has access to your bucket. However, they only control access for users outside of your own AWS account, such as access from other AWS accounts, or public access. ACLs are not as granular as bucket policies can be and so the permissions are broad in access. For example, list objects and write objects. You may be familiar with a number of recent security instance in which huge amounts of data have been unnecessarily exposed on Amazon S3 because the owners of that data failed to restrict public access to these buckets which may have contained personally identifiable information. Understanding who has access to your buckets and data is essential when using Amazon S3 due to the potential of it being accessible across the internet.
Data encryption, S3 offers a number of different encryption mechanisms to allow you to be able to encrypt your data. These methods cover both server side encryption and client side encryption options. Which include server side encryption with S3 managed keys, known as SSE-S3. Server side encryption with KMS managed keys, known as SSE-KMS. And KMS is the key management service. More information on KMS can be found here within our existing course. Server side encryption with customer managed keys, known as SSE-C. And then we have client side encryption with KMS managed keys, known as CSE-KMS. And finally, client side encryption with customer managed keys, known as CSE-C. The main difference between client side and server side encryption is the location at which the encryption takes place. Server side encryption takes place within a AWS S3 and client side encryption occurs on your client prior to uploading your objects. S3 also fully supports encryption in transit via SSL secure sockets layer. For a graphical view of the process of encryption and decryption for each of these please view the following infographic.
I now want to discuss a couple of features relating to data management. Versioning, when you enable versioning on a bucket it allows for multiple versions of the same object to exist. This is useful to allow you to retrieve previous versions of a file or recover from some accidental deletion, or indeed intended malicious deletion of an object. Versioning is managed and created automatically by the bucket when you overwrite the same object. For easier management, S3 will only display the latest version of the object within the console but it does provide a way of viewing all versions as and when you need to see the previous versions. Versioning is not enabled by default, however once you have enabled it you need to be aware of two main points. Firstly, you can't disable versioning. You can suspend it on the bucket which will prevent any further versions from being created of your objects, but you can't disable it altogether. Secondly, versioning will be an added cost to you as you are storing multiple versions of the same object and the Amazon S3 cost model is based on actual usage of data.
Lifecycle rules, lifecycle rules in AWS provide an automatic method of managing the life of your data while it is being stored on Amazon S3. By adding a lifecycle rule to a bucket you are able to configure and set specific criteria that can automatically move your data from one class to another, move it to Amazon Glacier, or delete it from Amazon S3 altogether. You may want to do this as a cost saving exercise, by moving data to a cheaper storage class after a set period of time. For example, 30 days. And once those 30 days are up, Amazon S3 will then automatically change your storage class off that data as per the lifecycle rule. Another example would be that you may only be required to keep the data for a set period of time before it can be deleted, again, saving you money on storage. In this scenario you can set the bucket's lifecycle policy to automatically delete anything older than 90 days for example. The time frames are up to you and your own set requirements.
Let me now talk about some of the common use cases of Amazon S3 that are used by many organizations. AWS is commonly used in a number of different use cases, due to the features I've already discussed making it widely accessible and usable for different data types. So let me now just cover a few different scenarios where Amazon S3 would be a good solution for your storage requirement, starting with data backup. Many people find that the highly scalable and reliable components of Amazon S3 an ideal choice to store data backup. Either for existing AWS resources that you are using or for your own on-premise data. AWS also offers solutions to help you manage the transfer of your on-premise production data to Amazon S3 as a backup to your primary storage on site. Later in this course I will be discussing methods on how this can be achieved. With its ability to scout enormously and the flexibility of being able to retrieve your data with ease, it's easy to see why S3 makes a great data backup solution. When your data is stored on S3 it can be, permissions allowing, accessible from anywhere you have an internet connection if required. This is another reason S3 makes a great service for data backup solutions allowing you to retrieve your data from anywhere should you need to access it.
Static content and websites, S3 is perfect for storing static data such as images and video which are used on almost every website. Every object can also be referenced directly via unique URL or by a content delivering network such as Amazon CloudFront which interacts closely with S3. If your website is entirely static then your whole website can in fact be hosted on Amazon S3 providing a highly scalable and cost effective method for running your website. I will cover Amazon CloudFront later in the course, however if you need additional information on both Amazon CloudFront and running the static website within Amazon S3, please see the following links to our existing courses and labs on this topic.
Large data sets, S3 is also great to store computational, scientific, and statistical data allowing you to perform big data analytics. This makes this kind of data accessible to multiple parties at once for analysis without impacting performance thanks to the horizontal scaling abilities of S3. Due to the size of some of this data it's also a very cost effective method of storing large amounts of data that can be easily accessed and shared by a number of people.
Integration with other AWS services, Amazon S3 is widely used by a number of other AWS services within AWS to help those services perform their own functionality and features behind the scenes. For example, the Elastic Block Store service known as EBS, which I'll be discussing more in detail in an upcoming lecture, is able to create a backup of itself and store this backup on Amazon S3 as a snapshot. However, unlike the buckets that you create and the data that you store on S3 your EBS snapshots are not visible in any S3 buckets that you own. The snapshots are managed by AWS and hidden from S3 console as these snapshots are only backed by S3 and do not offer you the ability to manage their storage requirements as there is no need for you to so. Using S3 for this purpose, makes your EBS snapshots highly available and highly resilient. Another example would be logging. Many services use Amazon S3 to store their logs, such as AWS CloudTrail. AWS CloudTrail is a service that records and tracks all API called made within your AWS account. These API calls are recorded as events and then stored within a log file. This log file is then stored on S3. Again, due to the highly scalable and reliable nature of this service it makes sense for other services to use S3 for purposes such as this. In this instance, you are able to view the CloudTrail logs by one of your configured buckets specified during the CloudTrail creation. Amazon S3 can also be used as an origin for an Amazon CloudFront distribution. During the configuration of your CloudFront distribution, which I'll explain in further detail in a later lecture, you are able to specify a bucket to be used that stores files that are to be distributed out to AWS edge locations to help reduce web access latency for your end users.
As with many services, the cost of S3 storage varies depending on the region you select. Let me look at the example of the London region, across the three different storage classes. The first thing that I want to point is that the reduced redundancy storage class is actually more expensive than the standard storage. Originally, reduced redundancy storage was introduced to reduce the cost over that of the standard class as a trade off against lower durability. However, for many regions this is no longer the case as you can see. It is now more cost effective to use the standard class which provides a greater level of resilience for a cheaper price. If you want to optimize your cost then the preferable option would be to use the infrequent access storage class. However, the availability of this class drops to 99.9% instead of 99.99%. The durability remains the same reaching 11 nines of durability. The more storage you use on S3 the cost of each gigabyte reduces when certain thresholds are reached.
When looking at your S3 costs you might feel you are only charged for the storage itself per gigabyte, however there are a number of other cost elements to S3 and it's worth being aware of a couple of these. Including request costs and data transfer costs. Your request costs are based on actions such as put, copy, post, and get requests and are charged on a per 10,000 requests basis. Again, it will depend on your storage class as to the cost of these requests. As an example, when using the standard storage class using the London region it will cost you just over five cents per 10,000 put, copy, post, or list requests. For get requests and all other types of request it will cost you just over four cents per 10,000 requests. Data being transferred into S3 is free, however to transfer data out to another region costs two cents per gigabyte. For the latest information on costs please refer to the Amazon S3 pricing page shown here.
Although I have covered a number of reasons as to why S3 is great for different storage solutions, it's not a catch all storage service. For example, it's not ideal for the following scenarios. Archiving data for long term use, perhaps for compliance. When the data is dynamic and changing very fast. If the data being stored requires a file system. And structured data that needs to queried. There are other storage services that I will discuss that are far more suited to perform these functionalities.
About the Author
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data centre and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 50+ courses relating to Cloud, most within the AWS category with a heavy focus on security and compliance
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.