Scaling Massive Content with Alfresco and Amazon Aurora

How Alfresco scaled to billions of documents on AWS

John Newton – Founder and, since 2005, CTO at Alfresco – used his AWS re:Invent presentation to talk about how Alfresco has been scaling to billions of documents and building apps capable of accessing that huge amount of content…all while moving from large data centers to cost-effective management on the Cloud.

Alfresco completely embraced the open-source model and built a collaborative environment that currently supports more than 1800 customers, eleven million users, seven billion documents, and less than 400 employees.
Alfresco Open Source Model on AWS

Why is content at scale important?

The initial challenge was to store one billion documents, which was quite an impressive amount of data ten years ago – definitely over the petabyte scale. Today, of course, searching Google for the word “Amazon” will return that many pages, but things were different in 2005.  Apparently someone tried configuring one million SharePoint servers back then, but of course that doesn’t work well.

The motivation behind this challenge can be identified in the incredible digital transformation that is driving huge flows of content: Cloud, Mobile, Social Networks, Big Data, etc., creating a whole new range of digital business. ECM (Enterprise Content Management), for instance, is a six billion dollar market.

So what are the main use cases for content at scale?

  • enterprise document libraries.
  • medical records.
  • transaction and logistic records.
  • government archives.
  • claims processing.
  • research and analysis.
  • real-time video.
  • discovery and litigation.
  • loans and policies.
  • IoT (Internet of Things).

Given this wide range of use cases, you can see why the numbers have grown so high: users need to search and retrieve documents, sync and share files, manage and archive all kinds of data content like records, images, and media. That’s why we have witnessed a conceptual transition from Content to Data, Files, and then EFSS. And that’s why John Newton admitted that working with such content architectures is a significant big data problem.

Since the main use case that drove Alfresco’s innovation was related to insurance companies, they also jumped on to the new Amazon Aurora database as soon as they could.

What is content at scale?

Content at scale is not just a matter of billion of documents. It also means dealing with a lot of geographically distributed users, who demand a certain level of read/write throughput.

Naturally, concurrency and volume size are serious and constant concerns, and large repositories in particular require both scaling up (clustered servers, databases, indexes, read replicas, etc) and scaling out (sharding, federation, replication, shared nothing, etc).

In the face of these issues, traditional approaches are limited in what they can provide for redundancy, elasticity, agility, geographic distribution, provisioning, and administration.

Why Amazon Aurora?

Alfresco’s solution is based on Amazon’s RDS, EBS, S3 and Glacier services. Their whole system is open source and developed in Java (you can read more about getting involved here).

John decided to move to Amazon Aurora for three main reasons:

  1. Aurora is highly available (sync/async replication).
  2. Aurora offers a significantly more efficient use of network I/O.
  3. Aurora is self-healing and fault-tolerant, with instant crash recovery.

To illustrate the kind of modifications he required to move his system to Aurora, John showed us a blank page: beyond a simple configuration switch, no modification was required.
Alfresco Amazon Aurora Load Balancer vs AWS
The Alfresco team also worked on some large scale benchmarking for concurrent loads and access (BM4), involving 1.2 billion documents, 500 simulated concurrent users (with Selenium) during 1 hour of constant load.

The system completed more than 15 million transactions, with a load-rate of 1200/s, 80% DB CPU load in bulk load, and Aurora’s indexes worked efficiently at 3.2TB. There were no size-related bottlenecks and John assured his audience that the very same infrastructure could sustain up to 20 billion documents.

Avatar

Written by

Alex Casalboni

Alex is a Software Engineer with a great passion for music and web technologies. He's experienced in web development and software design, with a particular focus on frontend and UX.

Related Posts

Alisha Reyes
Alisha Reyes
— August 22, 2019

How to Unlock Complimentary Access to Cloud Academy

Are you looking to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Cloud Security, Python, Java, or another technical skill? Then you'll want to mark your calendars for August 23, 2019. Starting Friday at 12:00 a.m. PDT (3:00 a.m. EDT), Cloud Academy is offering c...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house
Avatar
Michael Sheehy
— August 19, 2019

What Exactly Is a Cloud Architect and How Do You Become One?

One of the buzzwords surrounding the cloud that I'm sure you've heard is "Cloud Architect." In this article, I will outline my understanding of what a cloud architect does and I'll analyze the skills and certifications necessary to become one. I will also list some of the types of jobs ...

Read more
  • AWS
  • Cloud Computing
Avatar
Nitheesh Poojary
— August 19, 2019

Boto: Using Python to Automate AWS Services

Boto allows you to write scripts to automate things like starting AWS EC2 instances Boto is a Python package that provides programmatic connectivity to Amazon Web Services (AWS). AWS offers a range of services for dynamically scaling servers including the core compute service, Elastic...

Read more
  • Automated AWS Services
  • AWS
  • Boto
  • Python
Avatar
Andrew Larkin
— August 13, 2019

Content Roadmap: AZ-500, ITIL 4, MS-100, Google Cloud Associate Engineer, and More

Last month, Cloud Academy joined forces with QA, the UK’s largest B2B skills provider, and it put us in an excellent position to solve a massive skills gap problem. As a result of this collaboration, you will see our training library grow with additions from QA’s massive catalog of 500+...

Read more
  • AWS
  • Azure
  • content roadmap
  • Google Cloud Platform
Avatar
Adam Hawkins
— August 9, 2019

DevSecOps: How to Secure DevOps Environments

Security has been a friction point when discussing DevOps. This stems from the assumption that DevOps teams move too fast to handle security concerns. This makes sense if Information Security (InfoSec) is separate from the DevOps value stream, or if development velocity exceeds the band...

Read more
  • AWS
  • cloud security
  • DevOps
  • DevSecOps
  • Security
Avatar
Stefano Giacone
— August 8, 2019

Test Your Cloud Knowledge on AWS, Azure, or Google Cloud Platform

Cloud skills are in demand | In today's digital era, employers are constantly seeking skilled professionals with working knowledge of AWS, Azure, and Google Cloud Platform. According to the 2019 Trends in Cloud Transformation report by 451 Research: Business and IT transformations re...

Read more
  • AWS
  • Cloud skills
  • Google Cloud
  • Microsoft Azure
Avatar
Andrew Larkin
— August 7, 2019

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to estimate all types of resources, not the least of which are CPU, memory, storage, and network connectivity. Which resources you choose for your delivery —  cloud-based or local — is up to you. But you’ll definitely want...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud Platform
Joe Nemer
Joe Nemer
— August 6, 2019

Google Cloud vs AWS: A Comparison (or can they be compared?)

The "Google Cloud vs AWS" argument used to be a common discussion among our members, but is this still really a thing? You may already know that there are three major players in the public cloud platforms arena: Amazon Web Services (AWS), Microsoft Azure, and Google Cloud Platform (GCP)...

Read more
  • AWS
  • Google Cloud Platform
  • Kubernetes
Avatar
Stuart Scott
— July 29, 2019

Deployment Orchestration with AWS Elastic Beanstalk

If you're responsible for the development and deployment of web applications within your AWS environment for your organization, then it's likely you've heard of AWS Elastic Beanstalk. If you are new to this service, or simply need to know a bit more about the service and the benefits th...

Read more
  • AWS
  • elastic beanstalk
Avatar
Stuart Scott
— July 26, 2019

How to Use & Install the AWS CLI

What is the AWS CLI? | The AWS Command Line Interface (CLI) is for managing your AWS services from a terminal session on your own client, allowing you to control and configure multiple AWS services and implement a level of automation. If you’ve been using AWS for some time and feel...

Read more
  • AWS
  • AWS CLI
  • Command line interface
Alisha Reyes
Alisha Reyes
— July 22, 2019

Cloud Academy’s Blog Digest: July 2019

July has been a very exciting month for us at Cloud Academy. On July 10, we officially joined forces with QA, the UK’s largest B2B skills provider (read the announcement). Over the coming weeks, you will see additions from QA’s massive catalog of 500+ certification courses and 1500+ ins...

Read more
  • AWS
  • Azure
  • Cloud Academy
  • Cybersecurity
  • DevOps
  • Kubernetes
Avatar
Stuart Scott
— July 18, 2019

AWS Fundamentals: Understanding Compute, Storage, Database, Networking & Security

If you are just starting out on your journey toward mastering AWS cloud computing, then your first stop should be to understand the AWS fundamentals. This will enable you to get a solid foundation to then expand your knowledge across the entire AWS service catalog.   It can be both d...

Read more
  • AWS
  • Compute
  • Database
  • fundamentals
  • networking
  • Security
  • Storage