The course is part of this learning path
This course covers the core learning objective to meet the requirements of the 'Designing Storage solutions in AWS - Level 1' skill
Learning Objectives:
- Understand the different AWS storage services that are available
- Analyze the differences between Block, Object and File storage
- Understand what workloads are suitable for Amazon EBS
- Understand what workloads are suitable for Amazon S3
Amazon FSx. When looking at how to store data with an AWS, there are multiple ways to get the job done. However, each service that can store data is more or less tailored to a specific use case. The first storage option that you might have learned about was probably Amazon S3, the simple storage service. This service is incredibly durable, robust and can handle nearly infinite amounts of data. However, the problem that one might encounter when working with Amazon S3 is that it is an object based storage system. This means that if you decide to update a file that you have stored within the service, even if you only change a tiny portion of that file, you'll have to re-upload the entire object as a whole all over again. This is incredibly wasteful time-wise and network-wise if your solution involves constantly updating files with small changes. In general, Amazon S3 is a write once, read many kind of storage. So what should we do if we want to write the files more than just once in a while?
Well, this is where having a dedicated file system starts to become a good idea. At the moment, there are two options with an AWS for creating a file system. The first one is to use Amazon EFS, the Elastic File System and the second option is to use Amazon FSx. Amazon EFS versus Amazon FSx in general. Naturally, there might be a little confusion about which of these file systems one is supposed to use. Let's take a few moments here to separate these two so that you can make a good decision on what you might need for your architectures. EFS. Starting off, Amazon EFS will provide you with a fairly simple and scalable file system that can be used for Linux-based workloads. It is a regional service that can store your data within and across multiple availability zones, providing high availability and strong durability for your data. Amazon EFS is a managed NAS filer for your EC2 instances based on Network File System V4. It allows you to mount on various EC2 instances and on-premises compute devices.
FSx. Amazon FSx however, comes in two different flavors. The first option is FSx for Windows file server. It provides a fully managed Microsoft Windows file system. This means it will work well for your Windows based applications that require file storage. The second option is Amazon FSx for Lustre, which is more focused on high performance computing. This system is POSIX-compliant and ready for use on Linux-based applications. Now we're gonna go much deeper into the details of FSx here. But at a high level, you can start to make a decision on these services based on your needs. EFS. I just need a dead simple managed file system that multiple Linux instances can communicate with. FSx for Windows file server. I need a managed file system that can handle Windows workloads. FSx for Lustre. I work with with high-performance applications, they need a shared file system where speed is super important.
Alrighty, now that we have a basic idea of where everything sits within the AWS wheelhouse, we can start to dive into the specifics of FSx for Windows file server and FSx for Lustre. If you were more interested in Amazon EFS, please take a look over here for more details on that service. FSx for Windows file server. Amazon FSx for Windows file server provides a highly scalable, fully managed file storage solution that is accessible over SMB protocol. Simply, it is a shared file system that allows multiple computers to connect to a single location to access files and folders that they need. There are a number of applications and workloads that need to have access to a shared file system. FSx for Windows file server provides a simple to use platform that works well for home directories, line of business applications, web servers and content management, software development environments, various media workflows and data analytics. It is accessible from Windows machines, Linux boxes using the CIFS-util tools and macOS instances and devices. You can have thousands of connections active concurrently, which can be accessed from almost anywhere in the world. Since the service is built on Windows file server, it allows you to define a number of administrative features, such as user quotas, end-user file restore and Microsoft Active Directory integration. This AD can be a AWS managed AD or self-managed one from on premises. This means you can have strong integrations with your organizations that allow for both authentication and authorization of your users and files within the system. You can even use your own ACLs and shared level access controls.
Now, when it comes down to the raw specifications of FSx for Windows file server, it is able to handle storage from 32 gigabytes of data all the way up to 64 terabytes of active file data. If your solution requires more storage than this, you have the ability to combine multiple FSx file systems together using Microsoft Distributed File System. By using Microsoft DFS, you can create a combined folder structure that can store hundreds of petabytes of data. Besides total data storage, you also have to specify how much throughput you'd expect the system to require on a per second basis. You can start as low as eight megabytes per second and move all the way up to two gigabytes per second of network throughput capacity. Since workloads are often spiky and not coming in as a steady stream, FSx operates on a network I/O credit basis. You will accrue credits when the throughput is lower than the baseline limits and will lose credits when operating above the baseline. This allows the system to burst up the network throughput when a lot of data needs to move through the network at one time. Now let's take a look at this table to get an idea of how high the throughput can burst if you have enough credits available. If these numbers still do not meet your desired network throughput, you can again distribute the data amongst many file systems to get around these limits. FSx also provides a fast in-memory cache on the file server that will greatly increase performance for your most frequently accessed data.
Data deduplication. Another great feature of the service is that FSx has a data deduplication option that can be turned on to help save space on the network file system. Large data sets that are used by multiple users or systems will oftentimes have redundant data. This will increase the data storage cost if not dealt with. By turning on data deduplication, the system will remove redundant data by storing duplicate portions of the data only one time. This deduplication can run as a background process, which will not significantly affect the performance of the file system. It also is a transparent part of the file system and will not be obvious to your connected users or clients. Data deduplication is automatic and will continue to scan your file systems in the background, looking for any extra copies of data.
Encryption and security. Amazon FSx for Windows file server allows for encryption of the file system at rest, as well as encrypts data in transit using SMB Kerberos session keys when you access your file system from clients that support SMB 3.0 and higher. Maintenance and backups. You'll have to specify a time window for your maintenance to take place. It will happen fairly infrequently, about once every several weeks and will take up only a small fraction of the 30 minute time slot. During this time, if you're running a single-AZ file system it will be temporarily unavailable. A multi-AZ setup will automatically fail over and fail back between the AZs. During the rest of the scheduled maintenance window, your system can perform automatic backups of your file system. You can state how long you wish them to be kept, ranging from zero to 35 days. You can even have daily backups if you so wish. If you wanna keep backups for longer than 35 days, you'll have to start a user initiated backup. These are retained based on your own preferences and will have to be manually deleted on your own time. Creating and using the Windows file system/server. When you create an FSx for Windows file server/file system, you'll have the option to deploy either a single-AZ system or multi-AZ. Multi-AZs can help with fault tolerance and high availability, but will cost a little more to run. Upon creation, FSx will build a Windows file server and a share for you to access. This will be accessed via an elastic network interface, a network adapter, that FSx will place within your VPC. This will allow your instances to communicate with the file system.
You also have the option to connect to your on-premises users, servers and instances to FSx if you have a direct connection or a VPN connection that can access the FSx network adapter. To actually map a share from FSx to your local file system or computer you'll need to get the DNS name from the FSx console. You can then plug it right into your Windows Explorer and map it like a network drive. Now that we have an idea of what FSx for Windows file server can do, I think we can take a look at FSx for Luster and see some of the differences. FSx for Lustre. Amazon FSx for Lustre is another file system that is available within the FSx service. As the name implies, this version is built on top of the source Lustre file system. Behind the scenes, it's a shared POSIX file system that was purposely designed to work with applications that require extremely fast file storage. It is able to scale its performance, providing up to 100 gigabits per second of throughput, millions of IOPS and sub-millisecond latency. Lustre can also provide concurrent access for hundreds of thousands of compute cores at once. That number is just mind boggling honestly. Why does high-performance storage like this matter? Well, when performance large amounts of parallel processing with many compute resources, you need a storage system that is equally efficient and performant in retrieving and writing that data. Without this, a compute cluster would have to spend more time idle while waiting for storage to return whatever data is needed. High-performance storage reduces these bottlenecks, allowing you to get the full power out of your HPC workloads.
What this means in a use case scenario is that FSx for Lustre works particularly well for workloads where speed and connection matters, such as high performance compute, machine learning, video processing and much much more. Like many other AWS services that implement open source technologies, AWS manages the setup, updating and general provisioning of the underlying service. Choosing the right storage types for your architecture. If you're going to use FSx for Lustre, this means you've determined a need for some type of high-performance file storage. One of the most important things to consider when designing your architectures around storage is understanding if your application is sensitive to throughput or IOPS. If your application requires the lowest latency, is IOPS intensive and features small random file operations, you are gonna wanna use some type of solid state disk. SSDs are particularly good at hopping around the disc space and grabbing data quickly at random locations. If your application requires the max amount of throughput, where you're accessing large files or have many sequential file operations, you'll wanna use a hard disc drive. HDDs are not good at jumping around because there's a physical component to these devices. They are very good however, at reading data that is right next to each other extremely quickly. And you can even provision an SSD cache on top of your hard disk drive to provide sub-millisecond latencies and higher IOPS for some of your most frequently accessed files within your hard disc drive.
Cloud bursting. There are many scenarios where your data might be located on prem but you don't have the necessary compute resources available locally to do the work that you need to do. This is where the idea of cloud bursting comes into play and is totally supported by FSx for Luster. You can spin up your compute cluster in the cloud and create a new file system for those resources to share. You can then mount that Lustre file system from on premises using direct connect or VPN connection. You then move that data temporarily into the file system so that it's physically close to the compute cluster. Once you've performed the desired work on your data, you can move everything back locally and shut down the resources you no longer need. Lustre also has deep integrations with Amazon S3. When using FSx for Lustre, you have the ability to link an S3 bucket into the file system and access the objects of that bucket just like they were normal files. When the contents within your S3 bucket change, it'll automatically update the FSx file system to show these changes. This is a very powerful workflow for many machine learning scenarios and other high-performance compute applications. For many workloads of this caliber, you'll oftentimes collect your data sets within S3 using it as an easy central storage container. If you wanted to perform some type of compute on these datasets, you would have to be in charge of moving the data around, copying it back and forth between a file system and S3. With FSx for Luster, this burden is dealt with for you. What is happening behind the scenes here is that all of the files within S3 get loaded up as a sort of icon or reference. It will allow you to see the files but in actuality, nothing yet exists on the hard disk as it were. When you go to actually access the file for the first time, FSx will do a sort of lazy loading and copy that S3 object into the file system. When you go to save the files, those are then back-filled into S3. If you wanna create new files within the system, FSx will build those and save them as an object within S3 as well. When using Amazon S3 as your data store like this, you can actually just shut down your file system and any associated compute when you're not actively using it, helping you to save a lot of money. Additionally, you have the ability to link the same S3 bucket across multiple availability zones. This allows your S3 bucket to be the single source of truth for all your data between AZs. This gives you the ability to have multiple HPC clusters in different AZs, working on the same data set without too much worry.
Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.
To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.
Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.
He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.
In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.
Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.