Introduction to Cosmos DB
Introduction to Using Cosmos DB
Introduction to Creating an App with Cosmos DB
Cosmos DB is one of many database solutions in a crowded market. From DynamoDB to Cassandra to CockroachDB, the questions one would naturally ask when examining Cosmos DB are, “what makes it special, and how can I get started with it?”
This course answers both of those questions as thoroughly and concisely as possible. This course is for anyone with an interest in database technologies or creating software using Microsoft Azure. Whether you are a DevOps engineer, a database admin, a product manager, or a sales expert, this course will help you learn how to power your technology with one of Azure's most innovative data solutions.
From this course, you will learn how to make use of Cosmos DB's incredible flexibility and performance. This course is made up of nine comprehensive lectures to guide you from the basics to how to create your own app using Cosmos DB.
- Learn the basic components of Cosmos DB
- Learn how to use Cosmos DB via the web portal, libraries, and CLI tools
- Learn how to create an application with Cosmos DB as the data backend
- People looking to build applications using Microsoft Azure
- People interested in database technologies
- General knowledge of IT architecture
- General knowledge of databases
For this section we're gonna focus on Cosmos DB's unique capabilities. We are not going to cover literally every single thing Cosmos DB can do. There is a lot of overlap with other database technologies and quite frankly if you just want a feature list you're better off reading their documentation, linked below. Instead, in this lesson, we're going to focus on the six most important and compelling capabilities unique to Cosmos DB. Those six features are, in order, global distribution of data, serverless architecture, multi-model support, throughput consistency guarantees, partitioning, and security. Let's start by talking about Cosmos DB multi-region support. This is one of the main reasons enterprises choose to use Cosmos DB. The fact that it is designed from the ground up to support access patterns from all over the planet. With over 50 geographic locations for its data centers, Cosmos DB users can ensure minimal latency for their users. What's more, new locations are regularly added each year. Multi-region logic is deeply integrated into the Cosmos DB service. As the user, you can associate as many geographic regions with your data as you want.
You can tune consistency levels for read and write operations, to improve availability or data precision. You can set entire regions as read only, write only, or read-write. Furthermore, you get built-in failover that lets your set priority lets you set priorities for each region, so you can decide what happens if one of your US data centers goes down, for example. You can plan for exactly which regions take precedence and how you will recover. In section two, we'll go into how all of these geolocation features are used via their Cosmos DB rest API and web console. The next key thing to introduce is Cosmos DB's serverless architecture. As described previously, Cosmos DB is an example of a database as a service. You do not set up database servers and manage them, instead you just get an endpoint for your app to utilize. The currency for making use of the Cosmos DB endpoint is known as request units. This will determine how much you pay and what sort of performance guarantees you can get. The larger your data, the more frequent your queries, the more indexing you do, the more consistency you demand, the larger the number of request units you will need. The nice thing about this system is it greatly simplified your data layer. You don't need to think about memories, CPU, hardware provisioning, OS optimization, updates and patches, SSL certs, et cetera, et cetera. All of this operational overhead of managing a database is gone. The time saved alone could translate into more than enough cost savings to offset the cost of needed request units. Cosmos DB's serverless architecture also ensures strong SLAs. You get a guarantee of 99.999% uptime, far better than what a typical tech company achieves on their own. You'll also get first order integration with other Azure services and great support. So let's move on and talk about the multi-model data support.
This I think is perhaps Cosmos DB's most intriguing feature. Cosmos DB offers APIs for Cassandra, Gremlin, MongoDB, SQL, and Table Key-Value API. This means that with a single Cosmos DB account, you can run multiple database engines. So if a portion of your data is best suited for Cassandra, you can set up Cassandra key spaces. And then if a subset of your data needs a document paradigm, you can use Mongo. If you happen to need a graph database and a relational database as well, you can add them as well using Gremlin and SQL. This means you have the flexibility of using the right data model for the job. It means you can easily migrate an existing heterogeneous data architecture into Cosmos DB with little hassle. Now there are some important trade-offs and restrictions that come from using multiple APIs. We'll dig into that in later sections. Next, let's talk about throughput and consistency. As far as the cap theorem goes, Cosmos DB is very strong on partition tolerance and availability. Like Cassandra, consistency is tunable and throughput is a function of your request units and consistency settings. Cosmos DB features five different consistency settings.
In order from strongest to weakest guarantees, they are strong, bounded staleness, session, consistent prefix, and eventual. If you need to ensure that always the most recent data is read, choose the strong consistency level. It ensures that no reads are processed until rights are completed durably by a quorum of replicas. With bounded staleness, you get a configurable level of consistency. Reads will lag behind writes by either an adjustable time interval or a number of item revisions. Then there's the session consistency level. This gives you a read your own writes guarantee suitable for scenarios where you need guarantees at the level of individual clients. It's considerably cheaper than bounded staleness and strong consistency levels, but you get no consistency guarantee outside of individual client session. And then lastly you have the consistent prefix and eventual consistency levels. Both of these guarantee that your data will eventually converge to the most recently written. With the consistent prefix, at least you get an additional guarantee that data will never be out of order. So even if you don't get the most recent data on read, you can at least be sure that you're not skipping over data inadvertently. Both of these consistency levels allow for fast throughput and are relatively inexpensive. The more inconsistency you can tolerate, the more you can save money on request unit usage.
Let's now talk about partitioning and indexing a bit. In Cosmos DB, there are physical partitions that compromise compute hardware resources. SSD storage, CPU, memory, logical partitions, which are a subset of the physical ones. In other words, a physical partition may be made up of several logical partitions. The basic abstraction for sets of data is a container. A container in Cosmos DB can span multiple physical partitions and will be responsible for storing your collections, graphs, SQL tables, et cetera. Every document in Cosmos DB is uniquely identifiable by the combination of its partition key and row key. The partition key, specifically, acts as a logical partition for your data and helps to create boundaries to enable cosmos DB to map data to specific physical resources. In Cosmos DB, the data for a single logical partition must reside on a single physical partition. When designing collections of data, two critical things to think about are partition key and indexing. On the latter point, Cosmos DB automatically indexes all of your data. However it's possible to create custom indexing policies that let you tune trade offs between query throughput and consistency. Now regarding partition keys however, you'll have to think carefully about the nature of your data to decide on a proper partition key. People coming from the Cassandra world will have some good intuition here.
The main thing you want is a column with high cardinality and a large variety of values to help distribute your workloads evenly. See Azure's best documentation for more details. We will cover both of the indexing and partition key selection in more depth in section three of this course when we get into the practical application. And finally let's talk a little about security. Now as a cloud-based service, Cosmos DB has many of the same security considerations as any other provider. You need to control who has access to your Azure account, who has credentials to use the API and ensure that sensitive data is properly isolated. The nice thing about Cosmos DB though is that it has very sane defaults when it comes to security. Without taking any action at all, Cosmos DB encrypts all data both at rest and in transit. Cosmos DB supports HTTPS, TLS for all client to server interactions. It also includes two different types of credentials for different use cases, master keys, and resource tokens.
The former are useful for administrators that need to make significant changes, while the latter can be used by clients with narrower needs. So there you have it, consider this your crash course into the world of Cosmos DB. For a more detailed breakdown, please take a look at the Cosmos DB documentation, it will answer many of your questions. Our goal in the following two sections is to go beyond the documentation and get you to actually use Cosmos DB on your own. In section two, we'll dig into how to set up and utilize Cosmos DB futures that were described here and in section three we'll walk through creating an actual software service. Good luck and see you there.
Jonathan Bethune is a senior technical consultant working with several companies including TopTal, BCG, and Instaclustr. He is an experienced devops specialist, data engineer, and software developer. Jonathan has spent years mastering the art of system automation with a variety of different cloud providers and tools. Before he became an engineer, Jonathan was a musician and teacher in New York City. Jonathan is based in Tokyo where he continues to work in technology and write for various publications in his free time.