How Alfresco scaled to billions of documents on AWS
John Newton – Founder and, since 2005, CTO at Alfresco – used his AWS re:Invent presentation to talk about how Alfresco has been scaling to billions of documents and building apps capable of accessing that huge amount of content…all while moving from large data centers to cost-effective management on the Cloud.
Alfresco completely embraced the open-source model and built a collaborative environment that currently supports more than 1800 customers, eleven million users, seven billion documents, and less than 400 employees.
Why is content at scale important?
The initial challenge was to store one billion documents, which was quite an impressive amount of data ten years ago – definitely over the petabyte scale. Today, of course, searching Google for the word “Amazon” will return that many pages, but things were different in 2005. Apparently someone tried configuring one million SharePoint servers back then, but of course that doesn’t work well.
The motivation behind this challenge can be identified in the incredible digital transformation that is driving huge flows of content: Cloud, Mobile, Social Networks, Big Data, etc., creating a whole new range of digital business. ECM (Enterprise Content Management), for instance, is a six billion dollar market.
So what are the main use cases for content at scale?
- enterprise document libraries.
- medical records.
- transaction and logistic records.
- government archives.
- claims processing.
- research and analysis.
- real-time video.
- discovery and litigation.
- loans and policies.
- IoT (Internet of Things).
Given this wide range of use cases, you can see why the numbers have grown so high: users need to search and retrieve documents, sync and share files, manage and archive all kinds of data content like records, images, and media. That’s why we have witnessed a conceptual transition from Content to Data, Files, and then EFSS. And that’s why John Newton admitted that working with such content architectures is a significant big data problem.
Since the main use case that drove Alfresco’s innovation was related to insurance companies, they also jumped on to the new Amazon Aurora database as soon as they could.
What is content at scale?
Content at scale is not just a matter of billion of documents. It also means dealing with a lot of geographically distributed users, who demand a certain level of read/write throughput.
Naturally, concurrency and volume size are serious and constant concerns, and large repositories in particular require both scaling up (clustered servers, databases, indexes, read replicas, etc) and scaling out (sharding, federation, replication, shared nothing, etc).
In the face of these issues, traditional approaches are limited in what they can provide for redundancy, elasticity, agility, geographic distribution, provisioning, and administration.
Why Amazon Aurora?
John decided to move to Amazon Aurora for three main reasons:
- Aurora is highly available (sync/async replication).
- Aurora offers a significantly more efficient use of network I/O.
- Aurora is self-healing and fault-tolerant, with instant crash recovery.
To illustrate the kind of modifications he required to move his system to Aurora, John showed us a blank page: beyond a simple configuration switch, no modification was required.
The Alfresco team also worked on some large scale benchmarking for concurrent loads and access (BM4), involving 1.2 billion documents, 500 simulated concurrent users (with Selenium) during 1 hour of constant load.
The system completed more than 15 million transactions, with a load-rate of 1200/s, 80% DB CPU load in bulk load, and Aurora’s indexes worked efficiently at 3.2TB. There were no size-related bottlenecks and John assured his audience that the very same infrastructure could sustain up to 20 billion documents.
Google Cloud Platform Certification: Preparation and Prerequisites
Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2019, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the second consecuti...
New Lab Challenges: Push Your Skills to the Next Level
Build hands-on experience using real accounts on AWS, Azure, Google Cloud Platform, and more Meaningful cloud skills require more than book knowledge. Hands-on experience is required to translate knowledge into real-world results. We see this time and time again in studies about how pe...
New on Cloud Academy: AWS Solution Architect Lab Challenge, Azure Hands-on Labs, Foundation Certificate in Cyber Security, and Much More
Now that Thanksgiving is over and the craziness of Black Friday has died down, it's now time for the busiest season of the year. Whether you're a last-minute shopper or you already have your shopping done, the holidays bring so much more excitement than any other time of year. Since our...
Understanding Enterprise Cloud Migration
What is enterprise cloud migration? Cloud migration is about moving your data, applications, and even infrastructure from your on-premises computers or infrastructure to a virtual pool of on-demand, shared resources that offer compute, storage, and network services at scale. Why d...
6 Reasons Why You Should Get an AWS Certification This Year
In the past decade, the rise of cloud computing has been undeniable. Businesses of all sizes are moving their infrastructure and applications to the cloud. This is partly because the cloud allows businesses and their employees to access important information from just about anywhere. ...
AWS Regions and Availability Zones: The Simplest Explanation You Will Ever Find Around
The basics of AWS Regions and Availability Zones We’re going to treat this article as a sort of AWS 101 — it’ll be a quick primer on AWS Regions and Availability Zones that will be useful for understanding the basics of how AWS infrastructure is organized. We’ll define each section,...
Application Load Balancer vs. Classic Load Balancer
What is an Elastic Load Balancer? This post covers basics of what an Elastic Load Balancer is, and two of its examples: Application Load Balancers and Classic Load Balancers. For additional information — including a comparison that explains Network Load Balancers — check out our post o...
Advantages and Disadvantages of Microservices Architecture
What are microservices? Let's start our discussion by setting a foundation of what microservices are. Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs). ...
Kubernetes Services: AWS vs. Azure vs. Google Cloud
Kubernetes is a popular open-source container orchestration platform that allows us to deploy and manage multi-container applications at scale. Businesses are rapidly adopting this revolutionary technology to modernize their applications. Cloud service providers — such as Amazon Web Ser...
AWS Internet of Things (IoT): The 3 Services You Need to Know
The Internet of Things (IoT) embeds technology into any physical thing to enable never-before-seen levels of connectivity. IoT is revolutionizing industries and creating many new market opportunities. Cloud services play an important role in enabling deployment of IoT solutions that min...
Which Certifications Should I Get?
As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...
How to Go Serverless Like a Pro
So, no servers? Yeah, I checked and there are definitely no servers. Well...the cloud service providers do need servers to host and run the code, but we don’t have to worry about it. Which operating system to use, how and when to run the instances, the scalability, and all the arch...