Amazon FSx for Lustre Makes High Performance Computing More Accessible

One service which attracted my attention from the AWS re:Invent announcements was Amazon FSx for Lustre. While we have seen a lot of performance improvements in cloud compute and networking recently, I/O performance feels quite stoic despite it often being the bottleneck in high-end computing. Thus, Amazon FSx for Lustre announcement really piqued my interest.

Lustre (or Linux Cluster) is a highly performant parallel file system. Lustre is by no means new. The Lustre File System was owned by Sun Microsystems, then Oracle before ending up at Seagate – who gave the project back to the Open Source community in 2015. The community surrounding the project remains strong and up until Linux 4.18, Lustre was included in the Linux distribution.

Lustre is a powerful tool and quite complex to implement and manage. Having Lustre available as an Amazon managed service means high performance I/O can be more accessible and user-friendly, especially to organizations starting out with data analysis.

There are many performant file systems out there. The Google File System (GFS) and the Hadoop File System both deliver high performance I/O as cluster-based file systems, but for non-Hadoop AWS users, the file system-as-a service choices were quite limited. The most common cloud file system is the Network File System (NFS). NFS was developed by Sun Microsystems and is also common in UNIX. There are other file systems and protocols such as Server Message Block (SMB) used by Windows and Apple File Protocol (AFP), but let’s not digress. The main issue is that NFS works really well, but not at scale. As network performance has improved, how NFS works with files hasn’t really changed much, save for supporting larger packet sizes. Due to the way NFS gets file attributes, metadata can become the bottleneck with NFS.

Amazon Elastic File System or Amazon Storage Gateway can deliver to the majority of general file system use cases, but when we want to move large files around quickly for data analysis, data modeling and machine learning, the I/O constraints of NFS and SMB begin to show. (Not to get you sidetracked but… Did you hear about GPU Acceleration for Faster Inferencing?)

Lustre is a Parallel File System which is similar to NFS4 in many ways, but different when you need it to scale. Both NFS and a Parallel File System use a client that accesses shared storage over TCP or UDP. The big difference is the I/O performance. A parallel file system client like Lustre can negotiate more connections/higher I/O than NFS can deliver talking NFS over TCP/IP. The parallel file system is built to scale – it can support very high numbers of files and managing metadata is a key part of its design. In Lustre the metadata server (MDS) and asset request are independent. That is why parallel file systems have become the preferred solution for high-performance computing systems like the ones we see at organizations like NASA.

Having Lustre available as a managed AWS solution makes it possible for the rest of us to build data solutions at scale. So let’s think about how we can start to use Amazon FSx for Lustre to solve business problems.

Imagine we want to analyze archived customer support tickets to look for patterns over 3 years of support tickets. The first problem for non-data scientists starting a data project is often just accessing and loading data.  We know we have terabytes of data we can analyze as we’ve been storing support ticket archives and logs as objects in Amazon S3 buckets. However, the naming of the buckets and the objects themselves has been haphazard over the years as people have come and gone from the process. Using the Amazon FSx for Lustre service, we can create a file system to sit in front of part or all of an S3 bucket or buckets. It takes a few minutes for Lustre to “read” the content and the associated metadata within a bucket using an ImportPath, and then present that as a file structure. The ImportPath “read” doesn’t move any items. The object data is only copied from the bucket when it is needed. We can manage access control using security groups, and the Amazon FSx for Lustre is HIPAA and PCI compliant. We also granular performance reporting from Cloudwatch – so potentially more performance metrics to help

Once we’ve created our file system, we can expect very fast access to it from numerous machines or clients running on EC2 or running in-house / on-premises.  If you are using AWS Direct Connect rather than a VPN, you will get the best performance. We might start with a small number of processing clients, but that number could swell dramatically as we start to harness the compute processing available to us. The business benefit is we won’t be constrained by I/O or the need to manage and scale Lustre nodes using the Amazon FSx for Lustre service.

Were you at re:Invent this year? If you didn’t get to stop by our booth and wanted to talk to us, don’t despair. Set up a conversation with us today

Cloud Academy