This section of the AWS Certified Solutions Architect - Professional learning path introduces you to the AWS database services relevant to the SAP-C02 exam. We then understand the service options available and learn how to select and apply AWS database services to meet specific design scenarios relevant to the AWS Certified Solutions Architect - Professional exam.
Want more? Try a Lab Playground or do a Lab Challenge!
- Understand the various database services that can be used when building cloud solutions on AWS
- Learn how to build databases using Amazon RDS, DynamoDB, Redshift, DocumentDB, Keyspaces, and QLDB
- Learn how to create ElastiCache and Neptune clusters
- Understand which AWS database service to choose based on your requirements
- Discover how to use automation to deploy databases in AWS
- Learn about data lakes and how to build a data lake in AWS
Choosing an AWS database service can be a very overwhelming process. Ultimately, there’s no perfect database out there that supports all data types and access patterns, ACID transactions, and strong consistency, scales limitlessly, maintains itself, and is cost-effective. I keep looking for one, but so far I haven't found it.
However, I have created a list of generic questions that can help you evaluate the AWS database services and narrow the list of options until you select the right one.
The first question you should ask yourself is:
“Do I need a relational database?”
Years ago, the answer to this question would almost certainly have been “yes! Of course, I need a relational database”. However, as more and more companies have journeyed down the quest to create the perfect database, consumers now have more options than ever. And while relational databases still dominate and are present in most major software systems, they’re no longer the only database in most architectures. With the rise of microservices and decoupling, companies are backing their applications with several databases and choosing the right database for the service or feature at hand.
With all of the options, you now have to take a closer look at what a relational database can provide you and if it's necessary for your use case. Let’s look at a use case for an e-commerce site that needs to determine if they’re using a relational or non-relational database.
This e-commerce site needs to store information about customers, their orders, and the products in their inventory. The company is off to a good start because they understand what their data is, including how it’s structured, the size of it, and the velocity of it.
They also can form relationships between this data easily. They know that customers place orders on their products. This lends itself to the table model easily. There’s a table of customer information that references an order ID. This would then point to a table that houses order information. The order table then might reference the product ID, which points to another table that contains information about product inventory. By joining these tables together, it prevents duplication of the data.
Now perhaps this company also has account balances for their customers, and it’s considered a priority to them that those balances always display the most up-to-date amount. In doing this, they want to ensure that if one part of a transaction fails, the whole transaction fails and the database remains unchanged. So, they’re requiring a database that has ACID compliance and strong consistency.
So, does this company need a relational database?
Well, the company understands their data is highly structured. That’s a check
The company requires relationships between their data. Another check.
Is it a benefit that these relationships are enforced by the database in a strict schema? Yes, in this case, it is.
And the company is requiring ACID compliance. Check.
So we can assume that yes, they need a relational database. After that’s decided, they can then ask themselves the next question, which is:
Do you need a managed or an unmanaged database?
You can run almost any database engine on an EC2 instance, and you have complete control and flexibility over the database itself, the operating system, and the software stack. This might be helpful if there are certain database features that aren’t available on an Amazon-managed service for a particular engine you use or if you require access to the underlying infrastructure.
Now the downside of running EC2 is the maintenance. For this option, you’ll need database administrators to handle the backing up, replication, patching, and clustering of the database, which can ultimately drive up the cost of this solution.
If you need a relational database that is unmanaged, then the answer is simple, run the relational engine of your choice, such as SQL Server, MariaDB, MySQL, Oracle or a data warehousing engine on EC2.
If you prefer not to manage the maintenance, backups, replication and patching of the underlying infrastructure, then a managed service would be a better option. In that case, your choices of managed relational databases on AWS come down to two main options: Amazon RDS and Amazon Redshift.
We can further decide which one to use by asking another question:
“Do you need an OLAP or an OLTP database?”
If you need a fully managed relational database that is used for OLAP use cases, then you should use Amazon Redshift. It is a data warehouse service that offers extremely fast query times for petabyte-scale data. Like other relational databases, Redshift is best used when your data is well-defined: you understand its structure, volume, and schema, so that it can be imported into a system that processes and indexes that data for faster queries. And because your data has already been processed and indexed, you can get very fast results for forecasting, reporting and dashboards, or machine learning use cases.
You do have to be mindful with Redshift pricing, as you’re paying for server uptime and not paying by query, so consistency of use is a factor to consider when using this service. The other factor is availability requirements. It’s a single-Region, single-AZ service by default. Although there are ways to set up disaster recovery instances, it can essentially double your cost.
With Redshift, you access your data by connecting using ODBC or JDBC or with the Redshift Query Editor to run SQL commands. If you need a fully managed relational database that is used for OLTP use cases, then you should use Amazon RDS. RDS supports 6 main database engine options, across three different categories:
The open-source options are MySQL, MariaDB, and PostgreSQL. The more commercial engine options are Oracle and SQL server. And then you have the cloud-native option of Amazon Aurora.
“So which engine do you choose?”
This often comes down to four main considerations:
The first is the skills your team has. Perhaps you have team members that have previously built a content management system that relied on a MySQL database, so they can bring that expertise to a new project.
Then there’s developer preference. I had a team member who had a preference for building his proof of concepts with MariaDB. To him, the database was well-understood and since speed was often important in his POCs, he knew that he could get the database up and running quickly. That way, he could move on to more important aspects of the application.
You also have to consider the types of frameworks and languages you’re using. For example, PostgreSQL is very popular among a lot of Python developers. This leads to database adapters, documentation, and support for the Python community that might make configuration and troubleshooting easier.
And last, you have to consider the existing database technology you’re currently using. This is especially common with commercial databases, where organizations have an existing Oracle and SQL server database that runs on-premises they want to migrate to the cloud.
In these cases, they have to be careful with RDS limitations. For example, RDS does not enable access to the underlying operating system or root access. In some cases, you may have legacy applications that need that additional access management to configure your application.
If you wanted this level of access, it used to be that you had to run the database on EC2. Now there’s another option called RDS Custom. With RDS Custom, you can run either an Oracle and SQL Server instance, and have the flexibility to customize patching, install third-party applications, and customize high-availability configurations. Additionally, you’ll have more granular access control such as host-level access and the ability to modify file systems.
Meanwhile, RDS will still take care of the setup, operation, and scaling of your database. This is what I refer to as a “semi-managed” database, where you still want some of the maintenance to be done on your behalf, but need the access and customization flexibility of running an “unmanaged” database.
There are some limitations with using RDS Custom that you’ll want to make sure you read through first. For example, RDS Custom for Oracle does not enable you to configure kerberos authentication, turn on storage auto-scaling, or use the Oracle multitenant architecture. For SQL Server, you can’t use read replicas or modify the storage size after creating the database, or create a multi-AZ deployment. If you need these features, running these engines on EC2 may be the better option.
You’ll also want to factor in your scalability requirements when choosing RDS as well. For MariaDB, MySQL, PostgreSQL, and Oracle on both RDS and RDS Custom, you can have up to 64 TiB of storage. For SQL Server instances on both RDS and RDS Custom, you can have up to 16 TiB of storage.
Now the last engine to talk about is Amazon Aurora, which is Amazon’s proprietary database engine. Aurora is a PostgreSQL and MySQL compatible database. Using Aurora instead of RDS MySQL or RDS PostgreSQL is said to provide greater performance and durability for your data because of its architecture. While RDS follows the traditional database model with primary and secondary databases plus up to 5 read replicas for every engine, Aurora decouples compute and storage, enabling the storage tier to span across six nodes in three separate AZs by default. Unlike RDS, it supports multi-master capabilities and up to 15 read replicas. It also supports faster scaling, automatic storage scaling by default, and faster failover.
Because of these features, it’s generally considered the best choice for relational OLTP workloads that operate at a massive scale with consistent traffic patterns. The maximum storage limitation for both MySQL and PostgreSQL on Aurora is 128 TB of storage. However, some engine versions can only scale to 64 TB.
If you don’t have consistent traffic, but still require an enterprise-level database that can operate at large scale, you should consider aurora serverless. This provides all the advantages of regular Amazon Aurora, but you don’t have to pay for uptime when the database is sitting idle not serving requests. Instead, it will automatically stop and start compute nodes to meet the needs of your traffic patterns.
In summary, if you need an OLAP data warehouse, Amazon Redshift is your best choice. If you need a fully-managed OLTP database that runs MariaDB, MySQL, PostgreSQL, Oracle, or SQL server, then choose Amazon RDS. If you need a semi-managed OLTP database that runs Oracle and SQL server, then run Amazon RDS custom. If RDS or RDS Custom have limitations on features that your workload requires, use EC2. If you need a fast database that operates at massive scale with consistent traffic, use Aurora. If you need a fast database that operates at massive scale, but you have bursty traffic - use Aurora serverless. That’s all for this one - see you next time.
Danny has over 20 years of IT experience as a software developer, cloud engineer, and technical trainer. After attending a conference on cloud computing in 2009, he knew he wanted to build his career around what was still a very new, emerging technology at the time — and share this transformational knowledge with others. He has spoken to IT professional audiences at local, regional, and national user groups and conferences. He has delivered in-person classroom and virtual training, interactive webinars, and authored video training courses covering many different technologies, including Amazon Web Services. He currently has six active AWS certifications, including certifications at the Professional and Specialty level.