Amazon Neptune


Course Introduction
RDS vs. EC2
RDS vs. EC2
DynamoDB Accelerator

The course is part of this learning path

Start course
4h 21m

This section of the AWS Certified Solutions Architect - Professional learning path introduces you to the AWS database services relevant to the SAP-C02 exam. We then understand the service options available and learn how to select and apply AWS database services to meet specific design scenarios relevant to the AWS Certified Solutions Architect - Professional exam. 

Want more? Try a Lab Playground or do a Lab Challenge

Learning Objectives

  • Understand the various database services that can be used when building cloud solutions on AWS
  • Learn how to build databases using Amazon RDS, DynamoDB, Redshift, DocumentDB, Keyspaces, and QLDB
  • Learn how to create ElastiCache and Neptune clusters
  • Understand which AWS database service to choose based on your requirements
  • Discover how to use automation to deploy databases in AWS
  • Learn about data lakes and how to build a data lake in AWS

Hello, and welcome to this lecture on Amazon Neptune. Amazon Neptune may not be as widely utilized as perhaps Amazon RDS or Amazon DynamoDB, simply due to what it is designed for. Amazon Neptune is a fast, reliable, secure, and fully managed graph database service.

For those who are unfamiliar with what a graph database is useful, they are essentially used to help you both store and navigate relationships between highly connected data which could contain billions of separate relationships. As a result, graph databases are ideal if their focus is on being able to identify these relationships of interconnected data, rather than the actual data itself. Trying to perform queries against complex relationships will be very difficult in a normal relational database model. And so graph databases are recommended in this scenario instead.

Before I continue, let me run through some of the use cases to help you understand when and where you might use Amazon Neptune as a graph database to solidify the importance of being able to query complex relationships.

Social networking. Graph databases are a powerful asset when used within a social networking environment. As you can imagine, there are vast webs of tightly networked data that run across social networking platforms, and understanding these relationships and being able to query against them is vital to being able to build and maintain effective social network applications. For example, being able to present the latest feed updates to your end users with relevant news from all the groups that they belong to can be easily generated using graph databases, thanks to the highly scalable and enhanced performance of Amazon Neptune.

Fraud detection. Security should always be a number one priority in any cloud deployment solution, and using Amazon Neptune can help you from a security standpoint, using its high performance capabilities. If you are carrying out financial transactions within your environment, then you can build applications that allow Neptune to analyze the financial relationships of transactions to help you detect potential fortunate activity patterns in near real time response times. For example, you might be able to detect that multiple parties are trying to use the same financial details, all from various different locations.

Recommendation engines. Recommendation engines are widely used across many different websites, largely, eCommerce sites that recommend products based upon your search and purchase history. Using Neptune as a key component within your recommendation engine allows it to perform complex queries based upon various different activities and operations made by the user that will help determine recommendations of what your customer may like to purchase next.

I've simply highlighted some of the common scenarios where you might use Amazon Neptune within your solutions. However, there are many, many more use cases available that focus on the relationships between vast amounts of highly interconnected data sets.

Now we have more of an understanding of when and where you might use Amazon Neptune, let's take a look at some of its components.

Amazon Neptune uses its own graph database engine and supports two graph query frameworks. These being Apache Tinkerpop Gremlin, and this allows you to query your graph running on your Neptune database, using the Gremlin traversal language. And we have the Worldwide Web Consortium Sparql. The Sparql query language has been designed to work with the internet and can be used to run queries against your Neptune database graph.

When creating your Amazon Neptune database, you must create a name for your Neptune database cluster, but what is a cluster?

An Amazon Neptune database cluster is comprised of a single, or if required, multiple database instances across different availability zones, in addition to a virtual database cluster volume which contains the data across all instances within the cluster. The single cluster volume consists of a number of Solid State Discs, SSDs. As your graph database grows, your shared volume will automatically scale an increase in size as required to a maximum of 64 terabytes.

To ensure high availability is factored into Neptune, each cluster maintains a separate copy of the shared volume in at least three different availability zones. This provides a high level of durability to the data. 

From a storage perspective, Amazon Neptune has another great feature to help with the durability and reliability of data being stored across your shared cluster, this being Neptune Storage Auto-Repair.

Storage Auto-Repair will automatically find and detect any segment failures that are present in the SSDs that make up the shared volume, and then automatically repair that segment using the data from the other volumes in the cluster. This ensures that the data loss is minimized and the need to restore from a failure is drastically reduced.

Similarly to other AWS database services, Amazon Neptune also has the capability to implement and run replica instances. If replicas are used, then each Neptune cluster will contain a primary database instance, which will be responsible for any read and write operations. The Neptune replicas, however, are used to scale your read operations, and so support read-only operations to the same cluster volume that the primary database instance connects to. As the replicas connect to the same source data as the primary, any read query results served by the replicas have minimal lag, typically less than a 100 milliseconds after new data has been written to the volume.

A maximum limit of 15 replicas per crust exists which can span multiple availability zones. And this ensures that should have failure occur in the availability zone hosting the primary database, one of the Neptune read replicas in a different AZ will be promoted to the primary database instance, and adopt both read and write operations. This process usually takes about 30 seconds.

Data is synchronized between the primary database instance and each replica synchronously. And in addition to providing a failover to your primary database instance, they offer support to read only queries. These queries can be served by your replicas, instead of utilizing resources on your primary instance. When you have created your Amazon Neptune database, you need to understand how to connect to it, and this is achieved through endpoints. An endpoint is simply a URL address and a port that points to your components. There are three different types of Amazon Neptune endpoints, these being cluster endpoint, reader endpoint, and instance endpoint.

Let me take a quick look at each of these endpoints individually, starting with a cluster endpoint. For every Neptune database cluster that you create, you will have a cluster endpoint, and this points directly to the current primary database instance of that cluster. This endpoint should be used by applications that required both read and write access to the database. Earlier, I explained that if a primary instant fails and you have a read replica available, then Neptune will automatically failover to one of these replicas and act as the primary, providing read and write access. When this happens, the cluster endpoint will then point to the new primary instance without any changes required by your applications accessing the database.

Reader endpoints. As you might expect by the name, this endpoint is purely used to connect to any read replicas you might have configured. This is used to allow applications to access your database on a read only basis for queries. Only a single reader endpoint exists, even if you have multiple read replicas. Connection served by the read replicas will be performed on a round-robin basis, and it's important to point out that the endpoint does not load balance your traffic in any way across the available replicas in your cluster.

Instance endpoints. For every instance within your cluster, including your primary and read replica instances, they will each have their own unique instance endpoint that will point to itself. This allows you to direct certain traffic to specific instances within the cluster. You might want to do this for load balancing reasons across your applications reading from your replicas.

About the Author
Learning Paths

Danny has over 20 years of IT experience as a software developer, cloud engineer, and technical trainer. After attending a conference on cloud computing in 2009, he knew he wanted to build his career around what was still a very new, emerging technology at the time — and share this transformational knowledge with others. He has spoken to IT professional audiences at local, regional, and national user groups and conferences. He has delivered in-person classroom and virtual training, interactive webinars, and authored video training courses covering many different technologies, including Amazon Web Services. He currently has six active AWS certifications, including certifications at the Professional and Specialty level.