Azure DocumentDb: In today’s database wars, we often find developers and database administrators on opposite sides. Developers often experiment with new libraries, software, and methodologies to write simpler and elegant code, especially when talking about data management, and tend to throw away legacy products quite easily. On the other side, database administrators with skills in the relational space tend to be more conservative and don’t believe that there could be something different from their beloved SQL language.
However, there are some good reasons for embracing the new opportunities presented by new NoSQL databases. In this post, we’d like to clear up some of the biggest misconceptions about NoSQL databases and introduce how developers can use document-based databases, specifically, Microsoft Azure’s DocumentDb.
ERP, Accounting and the relational approach
The 1980s were the pioneering era of Personal Computing (PC), the period when computers first began to show their potential. In the enterprise world, the PC empowered the daily routines of every office by structuring and automating many daily organizational tasks, often through applications such as Enterprise Resource Planning (ERP) and Accounting.
If the ’80s were about PCs, the ’90s were the decade of connecting computers within the enterprise using a LAN. Despite the introduction of the Web and the browser, ERP and Accounting are essentially the same as they were 30 years ago.
Today, we make relational databases responsible for data integrity by giving them the task of structuring data in normalized tables and relations. We assume that if data is contained in a relational database, we will always have full access to it to do whatever we want. In reality, if we could check the implementations of many applications, we would find that the database is not really responsible for data correctness and integrity. Developers tend to use different code for this. But I don’t want to discuss how a relational database can be used to do what it is supposed to do. I don’t want to move ERP into the NoSQL world because I believe that the relational model is better for ERP. I think that this is also true for Accounting software.
Instead, I want to talk about a sort of laziness in how developers approach new technologies and the feeling that we are moving away from relational databases.
The Internet Era
The biggest misconceptions about databases come from the Internet era. It is not about ERPs and Accounting software, which are fine just the way they are. Instead, many of today’s misconceptions about databases come from a whole new class of applications that are being addressed by a combination of new (web-based) and old (relational database) technologies. These applications are still useful today, but there are other possibilities.
For example, suppose you want to implement a (now) classic e-commerce website and you want to design for the purchased order. You create a flash sale event that offers a 50% discount for purchases within an hour. Because you sell worldwide, you could receive up to 10 times the normal number of orders during this time.
You have two options. If this is a sporadic event, you can choose to scale up the database server to handle the increased requests. You will pay more for it, but the success of the campaign should cover the additional expense.
You can also decide to schedule this type of campaign more frequently over time. To optimize the expenses of scaling up, you need to think of a better way for your application to handle this scenario.
Modeling database for internet purchase orders
With a relational database, this is one way to model the purchase order database:
You have an InternetOrders entity table that represents the order, with order total and other order-related information. It refers to a Customer table that represents the user that made the orders. Another table, InternetOrderDetails, contains the rows that represent all products purchased inside a single order, with quantity and price. Each detail references a Products table with unit price and other information related to the order.
To carry out an order, you need to perform (n+1) insertions in tables and (n+1) check constraints to verify that the data is integral. Once a user has submitted the checkout command from the web server, all operations are executed on the backend. Then, the backend service will carry the order to the relational database. This is just a simplification of the process, as a real world order would require additional operations. Still, you can perceive that in a flash sale, the number of database operations will greatly increase and the database operation will suffer.
We can try to develop some hypotheses as to why this is the case. We develop frontend and backend applications where separation is about organizational needs. Therefore, we have control over the data. We already have correct Customer and Product catalogs. We also need to include the current customer and product information inside an InternetOrder, as they can change over time. However, for traceability reasons, we need to do so at the moment of the order. For example, if Customer information changes, we don’t need to modify past orders.
In this scenario, if we use a JSON representation, we can write an InternetOrder in the following way:
Suppose you have a database that can save this document in a single write without separating every single entity (represented by the curly braces) in distinct tables and different insert operations. Also, suppose that you need a daily count of how many T-shirts you sold on a previous day. To do so, you need to aggregate orders on a product basis and discover the total quantity and costs, but you would do it the day after, not while customers are placing their orders. This is what NoSQL databases are good at.
Introducing NoSQL Databases
They use JSON format to persist data in an aggregated format. If the row inside a relational table is an entity, the Data Transfer Object (DTO) is the data aggregation format that JSON prefers.
NoSQL is developed with scalability in mind to optimize the ingestion of web data. NoSQL supports eventual consistency, which means that queries will be consistent with write in a non-predictable time. It is substantially different from the ACID consistency that relational databases implement. You have an unpredictable amount of time when queries will be consistent with writes.
This happens in many scenarios. Think about the Internet of Things. You need to collect lots of data from connected devices all around the world, but you don’t need to analyze it in real time; you don’t need precise data immediately, but perhaps just some statistics. Essentially, you need to be able to reliably collect all data, without any loss.
We say that NoSQL databases are partition-aware. They can subdivide the entire data store into partitions and each one can be distributed in different hosts to scale out the ingestion capacity. This is quite different from relational databases that must scale up because they cannot create partitions to support distributed writes.
That is what NoSQL databases are good at, but it also requires some operational skills, as partitioning implies handling a complex service infrastructure.
In general, developers are afraid to approach a database because they cannot reuse the SQL skills that they probably learned at school. Developers are also focused on code, so they don’t like thinking about operational issues such as distributed services and partitioning. However, they do appreciate a schema-less Db that is resilient to iterative schema changes and that promotes code-first development, without using Object to Relational Mappers.
Introducing Azure DocumentDb and its SQL language
There are many open source implementations of NoSQL databases, which are created as open source contributions. Microsoft has implemented NoSQL databases only in the Azure space; otherwise, they only offer hosted implementations and no on-premise, packaged NoSQL databases.
Azure DocumentDb is the proposition of a NoSQL, JSON document-based database. As a hosted solution, you can interact with it only using the public REST API, other than the Portal. It is a highly reliable service, as Azure DocumentDb is natively deployed with replica support. Every write performed on the primary database is replicated on two secondary distributed replicas, in different fault domains. Every read is distributed over the three copies to avoid bottlenecks. To ensure performance, DocumentDb databases are hosted on machines with SSD drives.
To interact with DocumentDb from the Azure Portal, you need to create a Database Account:
From the Portal, select DocumentDb, where it is a DocumentDb Account:
To create a new account, we need to specify a name, a resource group, and a location. A database account is mainly two things:
- unit of authorization
- unit of consistency
As a unit of authorization, a database account provides two sets of keys. A master key is necessary to perform all read/write operations to the database account. A read-only key can be used to perform read-only queries.
As a unit of consistency, as we can select a global consistency for all databases and collections that it contains. You can select four levels of consistency:
- Strong: The client always sees completely consistent data. It is the consistency level with the slowest reads/writes, but it can be used for critical applications like the stock market, banking, and airline reservations.
- Session: The default consistency level: A client reads its own writes, but other clients reading this same data might see older values.
- Bounded Staleness: The client might see old data, but it can specify a limit for how old that data can be (ex. 2 seconds). Updates happen in the order that they are received. It is similar to the Session consistency level, but speeds up reads while still preserving the order of updates.
- Eventual: The client might see old data for as long as it takes a write to propagate to all replicas. This is for high performance and availability, but a client may sometimes read out-of-date information or see updates that are out of order.
Consistency may also be specified at the query level. Future releases will update this area.
A database account is not a database, while a database account can contain multiple databases. With a recent October 2016 update, portal presents an interface that invites users to create collections and not a database. This is because the notion of a database inside DocumentDb is a sort of placeholder, a namespace. It is now included inside the creation of collections.
Azure DocumentDb Collections
If we have a relational database in mind, we may recall that we can store entities as rows inside tables. But collections are very different.
A table is a static container of homogenous and flat entities. A table ensures that every flat entity is composed of scalar properties as it cannot contain more complex structures, and it guarantees that all entities have the same properties, by name and by type. To create more complex schemas, a relational database uses multiple tables, relations, and referential integrity. Different entities are contained in different tables.
In DocumentDb, collections are schema-less containers of JSON documents. This means that a single collection can contain multiple different entities as it does not ensure any schema. DocumentDb cannot ensure that two distinct instances of the same kind of document, say an Internet Order, contain the same properties because of an error or specification updates.
Multiple collections inside a database are necessary, as a collection is both a unit of partitioning and a unit of throughput. It is a unit of partitioning because it can contain no more that 10Gb of data; with more than 10Gb of data, you must create multiple partitions. It is a unit of throughput because it can perform a limited number of operations. Operations are measured in terms of Resource Units (RU) and each operation can consume a different quantity of RU. Every collection has a quota of Request Units per second that it can spend in operations. This number is selected where a collection is created and, in some situations, can change if needed. This is a sample reference table containing costs in terms of operations performed on the collection:
|Operation||RU Consumed (estimated)|
|Reading a single 1KB document||1|
|Reading a single 2KB document||2|
|Query with a simple predicate for a 1KB document||3|
|Creating a single 1 KB document with 10 JSON properties (consistent indexing)||14|
|Create a single 1 KB document with 100 JSON properties (consistent indexing)||20|
|Replacing a single 1 KB document||28|
|Execute a stored procedure with two create documents||30|
These are only reference values because they depend on the real size of the document handled. An available web tool can help you more precisely calculate expenses to test with your document.
If in a second you try making requests for more than, say, 2500RUs, DocumentDb throttles the request. This is due to the multitenancy nature of DocumentDb. It must guarantee the performance declared when you create a collection; no more, no less. Another quota relates to size: A collection cannot be more than 10Gb. To use more space, you need to create other collections. This is why a collection is a unit of partitioning: You need to create multiple collections to handle all of the space you may need, and then you need to specify how to distribute documents over multiple collections. There are some recent updates in this field because you can use a client-side partition resolver in every query. As of late, you can also support server-side partitioning with partitioned collections.
To create a collection, go to the DocumentDb account:
To create our first documents, we can use Document Explorer (available under the left side navigation menu):
With Document Explorer, you can interactively write JSON documents directly inside a collection.
Here are a few additional details about writing inside of a collection:
- JSON syntax uses curly brackets for objects and squared brackets for arrays.
- IDs are typically GUIDs as there is no tool that can guarantee any number sequence.
- Information is contained inside the object, so for example, you do not need to connect customers or products to decode the product or customer name.
- customer_id and product_id probably refer to other documents: This means that information is not always embedded. Information may be embedded or referenced.
- Calculations can be embedded inside the document.
- Id is always automatically updated as “id” property if not otherwise specified. “id” must be unique inside of a collection.
Now we can try experimenting with Query Explorer. When you click Query Explorer, this opens a new blade that shows the query and the collection over which it is executed. Note that the query is executed inside a collection: It has no knowledge about other collections, so all documents must be contained in that collection.
Also, note that:
- You express a query with a SQL-like language.
- Results are always a JSON array.
- A query is “charged” inside a Request Unit quota up to the collection.
We can perform a more complex query if we look inside inner collections. For example, we can make a join because we need to create a different query focused on order item and not on the order. We want to create a query that lists ordered products independently from the order where they were defined.
DocumentDb is a great opportunity to approach a new database genre because its friction-less approach for developers uses objects that natively map with JSON and can be queried with an SQL-like language. DocumentDb is tailored for new Web and cloud-scale scenarios and for new cloud solutions that use multiple data sources.
In future articles, we’ll explore partitioning in more detail using DocumentDb from the REST API and Sdks.