Azure DocumentDb for the Enterprise Market

Azure DocumentDb: In today’s database wars, we often find developers and database administrators on opposite sides. Developers often experiment with new libraries, software, and methodologies to write simpler and elegant code, especially when talking about data management, and tend to throw away legacy products quite easily. On the other side, database administrators with skills in the relational space tend to be more conservative and don’t believe that there could be something different from their beloved SQL language.

However, there are some good reasons for embracing the new opportunities presented by new NoSQL databases. In this post, we’d like to clear up some of the biggest misconceptions about NoSQL databases and introduce how developers can use document-based databases, specifically, Microsoft Azure’s DocumentDb.

ERP, Accounting and the relational approach

The 1980s were the pioneering era of Personal Computing (PC), the period when computers first began to show their potential. In the enterprise world, the PC empowered the daily routines of every office by structuring and automating many daily organizational tasks, often through applications such as Enterprise Resource Planning (ERP) and Accounting.
If the ’80s were about PCs, the ’90s were the decade of connecting computers within the enterprise using a LAN. Despite the introduction of the Web and the browser, ERP and Accounting are essentially the same as they were 30 years ago.

Today, we make relational databases responsible for data integrity by giving them the task of structuring data in normalized tables and relations. We assume that if data is contained in a relational database, we will always have full access to it to do whatever we want. In reality, if we could check the implementations of many applications, we would find that the database is not really responsible for data correctness and integrity. Developers tend to use different code for this. But I don’t want to discuss how a relational database can be used to do what it is supposed to do. I don’t want to move ERP into the NoSQL world because I believe that the relational model is better for ERP. I think that this is also true for Accounting software.

Instead, I want to talk about a sort of laziness in how developers approach new technologies and the feeling that we are moving away from relational databases.

The Internet Era

The biggest misconceptions about databases come from the Internet era. It is not about ERPs and Accounting software, which are fine just the way they are. Instead, many of today’s misconceptions about databases come from a whole new class of applications that are being addressed by a combination of new (web-based) and old (relational database) technologies. These applications are still useful today, but there are other possibilities.

For example, suppose you want to implement a (now) classic e-commerce website and you want to design for the purchased order. You create a flash sale event that offers a 50% discount for purchases within an hour. Because you sell worldwide, you could receive up to 10 times the normal number of orders during this time.
You have two options. If this is a sporadic event, you can choose to scale up the database server to handle the increased requests. You will pay more for it, but the success of the campaign should cover the additional expense.
You can also decide to schedule this type of campaign more frequently over time. To optimize the expenses of scaling up, you need to think of a better way for your application to handle this scenario.

Modeling database for internet purchase orders

With a relational database, this is one way to model the purchase order database:
internetorder-relationalmodel
You have an InternetOrders entity table that represents the order, with order total and other order-related information. It refers to a Customer table that represents the user that made the orders. Another table, InternetOrderDetails, contains the rows that represent all products purchased inside a single order, with quantity and price. Each detail references a Products table with unit price and other information related to the order.
To carry out an order, you need to perform (n+1) insertions in tables and (n+1) check constraints to verify that the data is integral. Once a user has submitted the checkout command from the web server, all operations are executed on the backend. Then, the backend service will carry the order to the relational database. This is just a simplification of the process, as a real world order would require additional operations. Still, you can perceive that in a flash sale, the number of database operations will greatly increase and the database operation will suffer.

We can try to develop some hypotheses as to why this is the case. We develop frontend and backend applications where separation is about organizational needs. Therefore, we have control over the data. We already have correct Customer and Product catalogs. We also need to include the current customer and product information inside an InternetOrder, as they can change over time. However, for traceability reasons, we need to do so at the moment of the order. For example, if Customer information changes, we don’t need to modify past orders.
In this scenario, if we use a JSON representation, we can write an InternetOrder in the following way:
json-internetorder
Suppose you have a database that can save this document in a single write without separating every single entity (represented by the curly braces) in distinct tables and different insert operations. Also, suppose that you need a daily count of how many T-shirts you sold on a previous day. To do so, you need to aggregate orders on a product basis and discover the total quantity and costs, but you would do it the day after, not while customers are placing their orders. This is what NoSQL databases are good at.

Introducing NoSQL Databases

SQL language is the best representation of a relational database. A developer writes SQL code to describe the database structure (Data Definition Language – DDL) and database queries (Data Manipulation Language – DML). NoSQL is essentially an alternative to relational SQL databases. The name refers to “No (more) SQL”, or, better, “Not Only SQL”. This kind of database doesn’t use tables and rows. NoSQL databases use JSON as a persistence format, and JavaScript becomes the preferred language for interactions, queries, and performing operations.

They use JSON format to persist data in an aggregated format. If the row inside a relational table is an entity, the Data Transfer Object (DTO) is the data aggregation format that JSON prefers.
NoSQL is developed with scalability in mind to optimize the ingestion of web data. NoSQL supports eventual consistency, which means that queries will be consistent with write in a non-predictable time. It is substantially different from the ACID consistency that relational databases implement. You have an unpredictable amount of time when queries will be consistent with writes.
This happens in many scenarios. Think about the Internet of Things. You need to collect lots of data from connected devices all around the world, but you don’t need to analyze it in real time; you don’t need precise data immediately, but perhaps just some statistics. Essentially, you need to be able to reliably collect all data, without any loss.

We say that NoSQL databases are partition-aware. They can subdivide the entire data store into partitions and each one can be distributed in different hosts to scale out the ingestion capacity. This is quite different from relational databases that must scale up because they cannot create partitions to support distributed writes.
That is what NoSQL databases are good at, but it also requires some operational skills, as partitioning implies handling a complex service infrastructure.

In general, developers are afraid to approach a database because they cannot reuse the SQL skills that they probably learned at school. Developers are also focused on code, so they don’t like thinking about operational issues such as distributed services and partitioning. However, they do appreciate a schema-less Db that is resilient to iterative schema changes and that promotes code-first development, without using Object to Relational Mappers.

Introducing Azure DocumentDb and its SQL language

There are many open source implementations of NoSQL databases, which are created as open source contributions. Microsoft has implemented NoSQL databases only in the Azure space; otherwise, they only offer hosted implementations and no on-premise, packaged NoSQL databases.
Azure DocumentDb is the proposition of a NoSQL, JSON document-based database. As a hosted solution, you can interact with it only using the public REST API, other than the Portal. It is a highly reliable service, as Azure DocumentDb is natively deployed with replica support. Every write performed on the primary database is replicated on two secondary distributed replicas, in different fault domains. Every read is distributed over the three copies to avoid bottlenecks. To ensure performance, DocumentDb databases are hosted on machines with SSD drives.
To interact with DocumentDb from the Azure Portal, you need to create a Database Account:
create-new-documentdb-account
From the Portal, select DocumentDb, where it is a DocumentDb Account:
new-documentdb-account
To create a new account, we need to specify a name, a resource group, and a location. A database account is mainly two things:

  • unit of authorization
  • unit of consistency

As a unit of authorization, a database account provides two sets of keys. A master key is necessary to perform all read/write operations to the database account. A read-only key can be used to perform read-only queries.
documentdb-keys
As a unit of consistency, as we can select a global consistency for all databases and collections that it contains. You can select four levels of consistency:

  • Strong: The client always sees completely consistent data. It is the consistency level with the slowest reads/writes, but it can be used for critical applications like the stock market, banking, and airline reservations.
  • Session: The default consistency level: A client reads its own writes, but other clients reading this same data might see older values.
  • Bounded Staleness: The client might see old data, but it can specify a limit for how old that data can be (ex. 2 seconds). Updates happen in the order that they are received. It is similar to the Session consistency level, but speeds up reads while still preserving the order of updates.
  • Eventual: The client might see old data for as long as it takes a write to propagate to all replicas. This is for high performance and availability, but a client may sometimes read out-of-date information or see updates that are out of order.

Consistency may also be specified at the query level. Future releases will update this area.
documendb-default-consistency
A database account is not a database, while a database account can contain multiple databases. With a recent October 2016 update, portal presents an interface that invites users to create collections and not a database. This is because the notion of a database inside DocumentDb is a sort of placeholder, a namespace. It is now included inside the creation of collections.

Azure DocumentDb Collections

If we have a relational database in mind, we may recall that we can store entities as rows inside tables. But collections are very different.
A table is a static container of homogenous and flat entities. A table ensures that every flat entity is composed of scalar properties as it cannot contain more complex structures, and it guarantees that all entities have the same properties, by name and by type. To create more complex schemas, a relational database uses multiple tables, relations, and referential integrity. Different entities are contained in different tables.

In DocumentDb, collections are schema-less containers of JSON documents. This means that a single collection can contain multiple different entities as it does not ensure any schema. DocumentDb cannot ensure that two distinct instances of the same kind of document, say an Internet Order, contain the same properties because of an error or specification updates.

Multiple collections inside a database are necessary, as a collection is both a unit of partitioning and a unit of throughput. It is a unit of partitioning because it can contain no more that 10Gb of data; with more than 10Gb of data, you must create multiple partitions. It is a unit of throughput because it can perform a limited number of operations. Operations are measured in terms of Resource Units (RU) and each operation can consume a different quantity of RU. Every collection has a quota of Request Units per second that it can spend in operations. This number is selected where a collection is created and, in some situations, can change if needed. This is a sample reference table containing costs in terms of operations performed on the collection:

Operation RU Consumed (estimated)
Reading a single 1KB document 1
Reading a single 2KB document 2
Query with a simple predicate for a 1KB document 3
Creating a single 1 KB document with 10 JSON properties (consistent indexing) 14
Create a single 1 KB document with 100 JSON properties (consistent indexing) 20
Replacing a single 1 KB document 28
Execute a stored procedure with two create documents 30

These are only reference values because they depend on the real size of the document handled. An available web tool can help you more precisely calculate expenses to test with your document.
If in a second you try making requests for more than, say, 2500RUs, DocumentDb throttles the request. This is due to the multitenancy nature of DocumentDb. It must guarantee the performance declared when you create a collection; no more, no less. Another quota relates to size: A collection cannot be more than 10Gb. To use more space, you need to create other collections. This is why a collection is a unit of partitioning: You need to create multiple collections to handle all of the space you may need, and then you need to specify how to distribute documents over multiple collections. There are some recent updates in this field because you can use a client-side partition resolver in every query. As of late, you can also support server-side partitioning with partitioned collections.
To create a collection, go to the DocumentDb account:
create-documentdb-collection

Exploring Collections

To create our first documents, we can use Document Explorer (available under the left side navigation menu):
select-document-explorer
With Document Explorer, you can interactively write JSON documents directly inside a collection.
document-explorer-0001
Here are a few additional details about writing inside of a collection:

  • JSON syntax uses curly brackets for objects and squared brackets for arrays.
  • IDs are typically GUIDs as there is no tool that can guarantee any number sequence.
  • Information is contained inside the object, so for example, you do not need to connect customers or products to decode the product or customer name.
  • customer_id and product_id probably refer to other documents: This means that information is not always embedded. Information may be embedded or referenced.
  • Calculations can be embedded inside the document.
  • Id is always automatically updated as “id” property if not otherwise specified. “id” must be unique inside of a collection.

Now we can try experimenting with Query Explorer. When you click Query Explorer, this opens a new blade that shows the query and the collection over which it is executed. Note that the query is executed inside a collection: It has no knowledge about other collections, so all documents must be contained in that collection.
query-explorer-0001
Also, note that:

  • You express a query with a SQL-like language.
  • Results are always a JSON array.
  • A query is “charged” inside a Request Unit quota up to the collection.

We can perform a more complex query if we look inside inner collections. For example, we can make a join because we need to create a different query focused on order item and not on the order. We want to create a query that lists ordered products independently from the order where they were defined.
query-explorer-0002

Conclusion

DocumentDb is a great opportunity to approach a new database genre because its friction-less approach for developers uses objects that natively map with JSON and can be queried with an SQL-like language. DocumentDb is tailored for new Web and cloud-scale scenarios and for new cloud solutions that use multiple data sources.

In future articles, we’ll explore partitioning in more detail using DocumentDb from the REST API and Sdks.

 

Avatar

Written by

Marco Parenzan

Marco Parenzan is a Research Lead for Microsoft Azure in Cloud Academy. He has been awarded three times as a Microsoft MVP on Microsoft Azure. He is a speaker in major community events in Italy about Azure and .NET development and he is a community lead for 1nn0va, an official Microsoft community in Pordenone, Italy. He has written a book on Azure in 2016. He loves IoT and retrogaming.


Related Posts

Alisha Reyes
Alisha Reyes
— October 1, 2019

New on Cloud Academy: ITIL® 4, Microsoft 365 Tenant, Jenkins, TOGAF® 9.1, and more

At Cloud Academy, we're always striving to make improvements to our training platform. Based on your feedback, we released some new features to help make it easier for you to continue studying. These new features allow you to: Remove content from “Continue Studying” section Disc...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • ITIL® 4
  • Jenkins
  • Microsoft 365 Tenant
  • New content
  • Product Feature
  • Python programming
  • TOGAF® 9.1
Avatar
Jeremy Cook
— September 17, 2019

Cloud Migration Risks & Benefits

If you’re like most businesses, you already have at least one workload running in the cloud. However, that doesn’t mean that cloud migration is right for everyone. While cloud environments are generally scalable, reliable, and highly available, those won’t be the only considerations dri...

Read more
  • AWS
  • Azure
  • Cloud Migration
Joe Nemer
Joe Nemer
— September 6, 2019

Google Cloud Functions vs. AWS Lambda: The Fight for Serverless Cloud Domination

Serverless computing: What is it and why is it important? A quick background The general concept of serverless computing was introduced to the market by Amazon Web Services (AWS) around 2014 with the release of AWS Lambda. As we know, cloud computing has made it possible for users to ...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— August 30, 2019

New on Cloud Academy: CISSP, AWS, Azure, & DevOps Labs, Python for Beginners, and more…

As Hurricane Dorian intensifies, it looks like Floridians across the entire state might have to hunker down for another big one. If you've gone through a hurricane, you know that preparing for one is no joke. You'll need a survival kit with plenty of water, flashlights, batteries, and n...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • New content
  • Product Feature
  • Python programming
Alisha Reyes
Alisha Reyes
— August 22, 2019

How to Unlock Complimentary Access to Cloud Academy

Are you looking to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Cloud Security, Python, Java, or another technical skill? Then you'll want to mark your calendars for August 23, 2019. Starting Friday at 12:00 a.m. PDT (3:00 a.m. EDT), Cloud Academy is offering c...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house
Avatar
Andrew Larkin
— August 13, 2019

Content Roadmap: AZ-500, ITIL 4, MS-100, Google Cloud Associate Engineer, and More

Last month, Cloud Academy joined forces with QA, the UK’s largest B2B skills provider, and it put us in an excellent position to solve a massive skills gap problem. As a result of this collaboration, you will see our training library grow with additions from QA’s massive catalog of 500+...

Read more
  • AWS
  • Azure
  • content roadmap
  • Google Cloud Platform
Avatar
Andrew Larkin
— August 7, 2019

Disadvantages of Cloud Computing

If you want to deliver digital services of any kind, you’ll need to estimate all types of resources, not the least of which are CPU, memory, storage, and network connectivity. Which resources you choose for your delivery —  cloud-based or local — is up to you. But you’ll definitely want...

Read more
  • AWS
  • Azure
  • Cloud Computing
  • Google Cloud Platform
Orion Withrow
Orion Withrow
— July 24, 2019

How to Effectively Use Azure Management Groups, Subscriptions, and Resource Groups

When used individually, Azure Management Groups, Subscriptions, and Resource Groups are very powerful. But when used together, they can establish the entire organizational structure of Azure. In this article, I will explain Azure Resource Manager, Management Groups, Subscriptions an...

Read more
  • Azure
  • azure management groups
  • azure resource groups
  • azure subscriptions
Alisha Reyes
Alisha Reyes
— July 22, 2019

Cloud Academy’s Blog Digest: July 2019

July has been a very exciting month for us at Cloud Academy. On July 10, we officially joined forces with QA, the UK’s largest B2B skills provider (read the announcement). Over the coming weeks, you will see additions from QA’s massive catalog of 500+ certification courses and 1500+ ins...

Read more
  • AWS
  • Azure
  • Cloud Academy
  • Cybersecurity
  • DevOps
  • Kubernetes
Avatar
Paola Di Pietro
— July 19, 2019

Top 10 Things Cybersecurity Professionals Need to Know

There has been an increase in data breaches over the recent years. With almost 143 million Americans who have had their data compromised in data breaches. These breaches include all sorts of sensitive data, including financial information, election controversies, social security, just t...

Read more
  • Azure
  • cyber security
  • Security
Avatar
Guy Hummel
— June 26, 2019

Running Apache Spark on Azure Databricks

In this article, we’ll cover how to set up an Azure Databricks cluster and how to run queries in an interactive notebook. However, this article only scratches the surface of what you can do with Azure Databricks. If you would like to learn more, including how to create graphs, run sched...

Read more
  • Azure
Avatar
Thomas Mitchell
— June 25, 2019

Understanding the Core Azure Architectural Components

Microsoft Azure relies on a few key architectural components to provide redundancy and high availability. Core Azure architectural components include Azure regions, Azure Availability Zones, resource groups, and the Azure Resource Manager. In this article, we’ll discuss the basics ab...

Read more
  • Azure