Azure DocumentDb for the Enterprise Market

Azure DocumentDb: In today’s database wars, we often find developers and database administrators on opposite sides. Developers often experiment with new libraries, software, and methodologies to write simpler and elegant code, especially when talking about data management, and tend to throw away legacy products quite easily. On the other side, database administrators with skills in the relational space tend to be more conservative and don’t believe that there could be something different from their beloved SQL language.

However, there are some good reasons for embracing the new opportunities presented by new NoSQL databases. In this post, we’d like to clear up some of the biggest misconceptions about NoSQL databases and introduce how developers can use document-based databases, specifically, Microsoft Azure’s DocumentDb.

ERP, Accounting and the relational approach

The 1980s were the pioneering era of Personal Computing (PC), the period when computers first began to show their potential. In the enterprise world, the PC empowered the daily routines of every office by structuring and automating many daily organizational tasks, often through applications such as Enterprise Resource Planning (ERP) and Accounting.
If the ’80s were about PCs, the ’90s were the decade of connecting computers within the enterprise using a LAN. Despite the introduction of the Web and the browser, ERP and Accounting are essentially the same as they were 30 years ago.

Today, we make relational databases responsible for data integrity by giving them the task of structuring data in normalized tables and relations. We assume that if data is contained in a relational database, we will always have full access to it to do whatever we want. In reality, if we could check the implementations of many applications, we would find that the database is not really responsible for data correctness and integrity. Developers tend to use different code for this. But I don’t want to discuss how a relational database can be used to do what it is supposed to do. I don’t want to move ERP into the NoSQL world because I believe that the relational model is better for ERP. I think that this is also true for Accounting software.

Instead, I want to talk about a sort of laziness in how developers approach new technologies and the feeling that we are moving away from relational databases.

The Internet Era

The biggest misconceptions about databases come from the Internet era. It is not about ERPs and Accounting software, which are fine just the way they are. Instead, many of today’s misconceptions about databases come from a whole new class of applications that are being addressed by a combination of new (web-based) and old (relational database) technologies. These applications are still useful today, but there are other possibilities.

For example, suppose you want to implement a (now) classic e-commerce website and you want to design for the purchased order. You create a flash sale event that offers a 50% discount for purchases within an hour. Because you sell worldwide, you could receive up to 10 times the normal number of orders during this time.
You have two options. If this is a sporadic event, you can choose to scale up the database server to handle the increased requests. You will pay more for it, but the success of the campaign should cover the additional expense.
You can also decide to schedule this type of campaign more frequently over time. To optimize the expenses of scaling up, you need to think of a better way for your application to handle this scenario.

Modeling database for internet purchase orders

With a relational database, this is one way to model the purchase order database:
internetorder-relationalmodel
You have an InternetOrders entity table that represents the order, with order total and other order-related information. It refers to a Customer table that represents the user that made the orders. Another table, InternetOrderDetails, contains the rows that represent all products purchased inside a single order, with quantity and price. Each detail references a Products table with unit price and other information related to the order.
To carry out an order, you need to perform (n+1) insertions in tables and (n+1) check constraints to verify that the data is integral. Once a user has submitted the checkout command from the web server, all operations are executed on the backend. Then, the backend service will carry the order to the relational database. This is just a simplification of the process, as a real world order would require additional operations. Still, you can perceive that in a flash sale, the number of database operations will greatly increase and the database operation will suffer.

We can try to develop some hypotheses as to why this is the case. We develop frontend and backend applications where separation is about organizational needs. Therefore, we have control over the data. We already have correct Customer and Product catalogs. We also need to include the current customer and product information inside an InternetOrder, as they can change over time. However, for traceability reasons, we need to do so at the moment of the order. For example, if Customer information changes, we don’t need to modify past orders.
In this scenario, if we use a JSON representation, we can write an InternetOrder in the following way:
json-internetorder
Suppose you have a database that can save this document in a single write without separating every single entity (represented by the curly braces) in distinct tables and different insert operations. Also, suppose that you need a daily count of how many T-shirts you sold on a previous day. To do so, you need to aggregate orders on a product basis and discover the total quantity and costs, but you would do it the day after, not while customers are placing their orders. This is what NoSQL databases are good at.

Introducing NoSQL Databases

SQL language is the best representation of a relational database. A developer writes SQL code to describe the database structure (Data Definition Language – DDL) and database queries (Data Manipulation Language – DML). NoSQL is essentially an alternative to relational SQL databases. The name refers to “No (more) SQL”, or, better, “Not Only SQL”. This kind of database doesn’t use tables and rows. NoSQL databases use JSON as a persistence format, and JavaScript becomes the preferred language for interactions, queries, and performing operations.

They use JSON format to persist data in an aggregated format. If the row inside a relational table is an entity, the Data Transfer Object (DTO) is the data aggregation format that JSON prefers.
NoSQL is developed with scalability in mind to optimize the ingestion of web data. NoSQL supports eventual consistency, which means that queries will be consistent with write in a non-predictable time. It is substantially different from the ACID consistency that relational databases implement. You have an unpredictable amount of time when queries will be consistent with writes.
This happens in many scenarios. Think about the Internet of Things. You need to collect lots of data from connected devices all around the world, but you don’t need to analyze it in real time; you don’t need precise data immediately, but perhaps just some statistics. Essentially, you need to be able to reliably collect all data, without any loss.

We say that NoSQL databases are partition-aware. They can subdivide the entire data store into partitions and each one can be distributed in different hosts to scale out the ingestion capacity. This is quite different from relational databases that must scale up because they cannot create partitions to support distributed writes.
That is what NoSQL databases are good at, but it also requires some operational skills, as partitioning implies handling a complex service infrastructure.

In general, developers are afraid to approach a database because they cannot reuse the SQL skills that they probably learned at school. Developers are also focused on code, so they don’t like thinking about operational issues such as distributed services and partitioning. However, they do appreciate a schema-less Db that is resilient to iterative schema changes and that promotes code-first development, without using Object to Relational Mappers.

Introducing Azure DocumentDb and its SQL language

There are many open source implementations of NoSQL databases, which are created as open source contributions. Microsoft has implemented NoSQL databases only in the Azure space; otherwise, they only offer hosted implementations and no on-premise, packaged NoSQL databases.
Azure DocumentDb is the proposition of a NoSQL, JSON document-based database. As a hosted solution, you can interact with it only using the public REST API, other than the Portal. It is a highly reliable service, as Azure DocumentDb is natively deployed with replica support. Every write performed on the primary database is replicated on two secondary distributed replicas, in different fault domains. Every read is distributed over the three copies to avoid bottlenecks. To ensure performance, DocumentDb databases are hosted on machines with SSD drives.
To interact with DocumentDb from the Azure Portal, you need to create a Database Account:
create-new-documentdb-account
From the Portal, select DocumentDb, where it is a DocumentDb Account:
new-documentdb-account
To create a new account, we need to specify a name, a resource group, and a location. A database account is mainly two things:

  • unit of authorization
  • unit of consistency

As a unit of authorization, a database account provides two sets of keys. A master key is necessary to perform all read/write operations to the database account. A read-only key can be used to perform read-only queries.
documentdb-keys
As a unit of consistency, as we can select a global consistency for all databases and collections that it contains. You can select four levels of consistency:

  • Strong: The client always sees completely consistent data. It is the consistency level with the slowest reads/writes, but it can be used for critical applications like the stock market, banking, and airline reservations.
  • Session: The default consistency level: A client reads its own writes, but other clients reading this same data might see older values.
  • Bounded Staleness: The client might see old data, but it can specify a limit for how old that data can be (ex. 2 seconds). Updates happen in the order that they are received. It is similar to the Session consistency level, but speeds up reads while still preserving the order of updates.
  • Eventual: The client might see old data for as long as it takes a write to propagate to all replicas. This is for high performance and availability, but a client may sometimes read out-of-date information or see updates that are out of order.

Consistency may also be specified at the query level. Future releases will update this area.
documendb-default-consistency
A database account is not a database, while a database account can contain multiple databases. With a recent October 2016 update, portal presents an interface that invites users to create collections and not a database. This is because the notion of a database inside DocumentDb is a sort of placeholder, a namespace. It is now included inside the creation of collections.

Azure DocumentDb Collections

If we have a relational database in mind, we may recall that we can store entities as rows inside tables. But collections are very different.
A table is a static container of homogenous and flat entities. A table ensures that every flat entity is composed of scalar properties as it cannot contain more complex structures, and it guarantees that all entities have the same properties, by name and by type. To create more complex schemas, a relational database uses multiple tables, relations, and referential integrity. Different entities are contained in different tables.

In DocumentDb, collections are schema-less containers of JSON documents. This means that a single collection can contain multiple different entities as it does not ensure any schema. DocumentDb cannot ensure that two distinct instances of the same kind of document, say an Internet Order, contain the same properties because of an error or specification updates.

Multiple collections inside a database are necessary, as a collection is both a unit of partitioning and a unit of throughput. It is a unit of partitioning because it can contain no more that 10Gb of data; with more than 10Gb of data, you must create multiple partitions. It is a unit of throughput because it can perform a limited number of operations. Operations are measured in terms of Resource Units (RU) and each operation can consume a different quantity of RU. Every collection has a quota of Request Units per second that it can spend in operations. This number is selected where a collection is created and, in some situations, can change if needed. This is a sample reference table containing costs in terms of operations performed on the collection:

OperationRU Consumed (estimated)
Reading a single 1KB document1
Reading a single 2KB document2
Query with a simple predicate for a 1KB document3
Creating a single 1 KB document with 10 JSON properties (consistent indexing)14
Create a single 1 KB document with 100 JSON properties (consistent indexing)20
Replacing a single 1 KB document28
Execute a stored procedure with two create documents30

These are only reference values because they depend on the real size of the document handled. An available web tool can help you more precisely calculate expenses to test with your document.
If in a second you try making requests for more than, say, 2500RUs, DocumentDb throttles the request. This is due to the multitenancy nature of DocumentDb. It must guarantee the performance declared when you create a collection; no more, no less. Another quota relates to size: A collection cannot be more than 10Gb. To use more space, you need to create other collections. This is why a collection is a unit of partitioning: You need to create multiple collections to handle all of the space you may need, and then you need to specify how to distribute documents over multiple collections. There are some recent updates in this field because you can use a client-side partition resolver in every query. As of late, you can also support server-side partitioning with partitioned collections.
To create a collection, go to the DocumentDb account:
create-documentdb-collection

Exploring Collections

To create our first documents, we can use Document Explorer (available under the left side navigation menu):
select-document-explorer
With Document Explorer, you can interactively write JSON documents directly inside a collection.
document-explorer-0001
Here are a few additional details about writing inside of a collection:

  • JSON syntax uses curly brackets for objects and squared brackets for arrays.
  • IDs are typically GUIDs as there is no tool that can guarantee any number sequence.
  • Information is contained inside the object, so for example, you do not need to connect customers or products to decode the product or customer name.
  • customer_id and product_id probably refer to other documents: This means that information is not always embedded. Information may be embedded or referenced.
  • Calculations can be embedded inside the document.
  • Id is always automatically updated as “id” property if not otherwise specified. “id” must be unique inside of a collection.

Now we can try experimenting with Query Explorer. When you click Query Explorer, this opens a new blade that shows the query and the collection over which it is executed. Note that the query is executed inside a collection: It has no knowledge about other collections, so all documents must be contained in that collection.
query-explorer-0001
Also, note that:

  • You express a query with a SQL-like language.
  • Results are always a JSON array.
  • A query is “charged” inside a Request Unit quota up to the collection.

We can perform a more complex query if we look inside inner collections. For example, we can make a join because we need to create a different query focused on order item and not on the order. We want to create a query that lists ordered products independently from the order where they were defined.
query-explorer-0002

Conclusion

DocumentDb is a great opportunity to approach a new database genre because its friction-less approach for developers uses objects that natively map with JSON and can be queried with an SQL-like language. DocumentDb is tailored for new Web and cloud-scale scenarios and for new cloud solutions that use multiple data sources.

In future articles, we’ll explore partitioning in more detail using DocumentDb from the REST API and Sdks.

 

Avatar

Written by

Marco Parenzan

Marco Parenzan is a Research Lead for Microsoft Azure in Cloud Academy. He has been awarded three times as a Microsoft MVP on Microsoft Azure. He is a speaker in major community events in Italy about Azure and .NET development and he is a community lead for 1nn0va, an official Microsoft community in Pordenone, Italy. He has written a book on Azure in 2016. He loves IoT and retrogaming.


Related Posts

Avatar
Cloud Academy Team
— July 9, 2020

Which Certifications Should I Get?

The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies.With all that in mind, the s...

Read more
  • AWS
  • Azure
  • Certifications
  • Cloud Computing
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— July 2, 2020

New Content: AWS, Azure, Typescript, Java, Docker, 13 New Labs, and Much More

This month, our Content Team released a whopping 13 new labs in real cloud environments! If you haven't tried out our labs, you might not understand why we think that number is so impressive. Our labs are not “simulated” experiences — they are real cloud environments using accounts on A...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— June 19, 2020

Kickstart Your Tech Training With a Free Week on Cloud Academy

Are you looking to make a jump in your technical career? Want to get trained or certified on AWS, Azure, Google Cloud Platform, DevOps, Kubernetes, Python, or another in-demand skill?Then you'll want to mark your calendar. Starting Monday, June 22 at 12:00 a.m. PDT (3:00 a.m. EDT), ...

Read more
  • AWS
  • Azure
  • cloud academy content
  • complimentary access
  • GCP
  • on the house
Joe Nemer
Joe Nemer
— June 12, 2020

Azure Certifications: Our Experts Explain Which Is Best for You

How do you choose an Azure certification? It can be hard to get started when choosing an Azure certification. There are so many to sift through, so many interesting options, and it requires a time commitment to just understand the cert landscape.To help guide you through the select...

Read more
  • AZ-900
  • Azure
  • Certifications
Alisha Reyes
Alisha Reyes
— June 11, 2020

New Content: AZ-500 and AZ-400 Updates, 3 Google Professional Exam Preps, Practical ML Learning Path, C# Programming, and More

This month, our Content Team released tons of new content and labs in real cloud environments. Not only that, but we introduced our very first highly interactive "Office Hours" webinar. This webinar, Acing the AWS Solutions Architect Associate Certification, started with a quick overvie...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Rebecca Willis
Rebecca Willis
— June 3, 2020

Azure vs. AWS: Which Certification Provides the Brighter Future?

More and more companies are using cloud services, prompting more and more people to switch their current IT position to something cloud-related. The problem is most people only have that much time after work to learn new technologies, and there are plenty of cloud services that you can ...

Read more
  • AWS
  • Azure
  • certification
Alisha Reyes
Alisha Reyes
— June 2, 2020

Blog Digest: 5 Reasons to Get AWS Certified, OWASP Top 10, Getting Started with VPCs, Top 10 Soft Skills, and More

Thank you for being a valued member of our community! We recently sent out a short survey to understand what type of content you would like us to add to Cloud Academy, and we want to thank everyone who gave us their input. If you would like to complete the survey, it's not too late. It ...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • OWASP
  • OWASP Top 10
  • Security
  • VPCs
Alisha Reyes
Alisha Reyes
— May 11, 2020

New Content: Alibaba, Azure Cert Prep: AI-100, AZ-104, AZ-204 & AZ-400, Amazon Athena Playground, Google Cloud Developer Challenge, and much more

This month, our Content Team released 8 new learning paths, 4 courses, 7 labs in real cloud environments, and 4 new knowledge check assessments. Not only that, but we introduced our very first course on Alibaba Cloud, and our expert instructors are working 'round the clock to create 6 n...

Read more
  • alibaba
  • AWS
  • Azure
  • gitops
  • Google Cloud Platform
  • lab playground
  • programming
Alisha Reyes
Alisha Reyes
— May 1, 2020

Introducing Our Newest Lab Environments: Lab Playgrounds

Want to train in a real cloud environment, but feel slowed down by spinning up your own deployments? When you consider security or pricing costs, it can be costly and challenging to get up to speed quickly for self-training. To solve this problem, Cloud Academy created a new suite of la...

Read more
  • AWS
  • Azure
  • Docker
  • Google Cloud Platform
  • Java
  • lab playgrounds
  • Python
Alisha Reyes
Alisha Reyes
— April 30, 2020

Blog Digest: AWS Breaking News, Azure DevOps, AWS Study Guide, 8 Ways to Prevent a Ransomware Attack, and More

  New articles by topicAWS Azure Data Science Google Cloud  Cloud Adoption Platform Updates & New Content Security Women in TechAWSBreaking News: All AWS Certification Exams Now Available Online As an Advanced AWS Technology Partner, C...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • programming
  • Security
Alisha Reyes
Alisha Reyes
— April 9, 2020

New on Cloud Academy: AWS Solutions Architect Exam Prep, Azure Courses, GCP Engineer Exam Prep, Programming, and More

Free content on Cloud Academy More and more customers are relying on our technology and content to keep upskilling their people in these months, and we are doing our best to keep supporting them. While the world fights the COVID-19 pandemic, we wanted to make a small contribution to he...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
  • programming
Avatar
Logan Rakai
— April 7, 2020

How to Effectively Use Azure DevOps

Azure DevOps is a suite of services that collaborate on software development following DevOps principles. The services in Azure DevOps are:Azure Repos for hosting Git repositories for source control of your code Azure Boards for planning and tracking your work using proven agil...

Read more
  • Azure
  • DevOps