Azure App Service in action
In the real world, there are generally two possibilities for hosting your web applications or APIs in the cloud. You can follow the classic model a...Learn More
Azure DocumentDb: In today’s database wars, we often find developers and database administrators on opposite sides. Developers often experiment with new libraries, software, and methodologies to write simpler and elegant code, especially when talking about data management, and tend to throw away legacy products quite easily. On the other side, database administrators with skills in the relational space tend to be more conservative and don’t believe that there could be something different from their beloved SQL language.
However, there are some good reasons for embracing the new opportunities presented by new NoSQL databases. In this post, we’d like to clear up some of the biggest misconceptions about NoSQL databases and introduce how developers can use document-based databases, specifically, Microsoft Azure’s DocumentDb.
The 1980s were the pioneering era of Personal Computing (PC), the period when computers first began to show their potential. In the enterprise world, the PC empowered the daily routines of every office by structuring and automating many daily organizational tasks, often through applications such as Enterprise Resource Planning (ERP) and Accounting.
If the ’80s were about PCs, the ’90s were the decade of connecting computers within the enterprise using a LAN. Despite the introduction of the Web and the browser, ERP and Accounting are essentially the same as they were 30 years ago.
Today, we make relational databases responsible for data integrity by giving them the task of structuring data in normalized tables and relations. We assume that if data is contained in a relational database, we will always have full access to it to do whatever we want. In reality, if we could check the implementations of many applications, we would find that the database is not really responsible for data correctness and integrity. Developers tend to use different code for this. But I don’t want to discuss how a relational database can be used to do what it is supposed to do. I don’t want to move ERP into the NoSQL world because I believe that the relational model is better for ERP. I think that this is also true for Accounting software.
Instead, I want to talk about a sort of laziness in how developers approach new technologies and the feeling that we are moving away from relational databases.
The biggest misconceptions about databases come from the Internet era. It is not about ERPs and Accounting software, which are fine just the way they are. Instead, many of today’s misconceptions about databases come from a whole new class of applications that are being addressed by a combination of new (web-based) and old (relational database) technologies. These applications are still useful today, but there are other possibilities.
For example, suppose you want to implement a (now) classic e-commerce website and you want to design for the purchased order. You create a flash sale event that offers a 50% discount for purchases within an hour. Because you sell worldwide, you could receive up to 10 times the normal number of orders during this time.
You have two options. If this is a sporadic event, you can choose to scale up the database server to handle the increased requests. You will pay more for it, but the success of the campaign should cover the additional expense.
You can also decide to schedule this type of campaign more frequently over time. To optimize the expenses of scaling up, you need to think of a better way for your application to handle this scenario.
With a relational database, this is one way to model the purchase order database:
You have an InternetOrders entity table that represents the order, with order total and other order-related information. It refers to a Customer table that represents the user that made the orders. Another table, InternetOrderDetails, contains the rows that represent all products purchased inside a single order, with quantity and price. Each detail references a Products table with unit price and other information related to the order.
To carry out an order, you need to perform (n+1) insertions in tables and (n+1) check constraints to verify that the data is integral. Once a user has submitted the checkout command from the web server, all operations are executed on the backend. Then, the backend service will carry the order to the relational database. This is just a simplification of the process, as a real world order would require additional operations. Still, you can perceive that in a flash sale, the number of database operations will greatly increase and the database operation will suffer.
We can try to develop some hypotheses as to why this is the case. We develop frontend and backend applications where separation is about organizational needs. Therefore, we have control over the data. We already have correct Customer and Product catalogs. We also need to include the current customer and product information inside an InternetOrder, as they can change over time. However, for traceability reasons, we need to do so at the moment of the order. For example, if Customer information changes, we don’t need to modify past orders.
In this scenario, if we use a JSON representation, we can write an InternetOrder in the following way:
Suppose you have a database that can save this document in a single write without separating every single entity (represented by the curly braces) in distinct tables and different insert operations. Also, suppose that you need a daily count of how many T-shirts you sold on a previous day. To do so, you need to aggregate orders on a product basis and discover the total quantity and costs, but you would do it the day after, not while customers are placing their orders. This is what NoSQL databases are good at.
They use JSON format to persist data in an aggregated format. If the row inside a relational table is an entity, the Data Transfer Object (DTO) is the data aggregation format that JSON prefers.
NoSQL is developed with scalability in mind to optimize the ingestion of web data. NoSQL supports eventual consistency, which means that queries will be consistent with write in a non-predictable time. It is substantially different from the ACID consistency that relational databases implement. You have an unpredictable amount of time when queries will be consistent with writes.
This happens in many scenarios. Think about the Internet of Things. You need to collect lots of data from connected devices all around the world, but you don’t need to analyze it in real time; you don’t need precise data immediately, but perhaps just some statistics. Essentially, you need to be able to reliably collect all data, without any loss.
We say that NoSQL databases are partition-aware. They can subdivide the entire data store into partitions and each one can be distributed in different hosts to scale out the ingestion capacity. This is quite different from relational databases that must scale up because they cannot create partitions to support distributed writes.
That is what NoSQL databases are good at, but it also requires some operational skills, as partitioning implies handling a complex service infrastructure.
In general, developers are afraid to approach a database because they cannot reuse the SQL skills that they probably learned at school. Developers are also focused on code, so they don’t like thinking about operational issues such as distributed services and partitioning. However, they do appreciate a schema-less Db that is resilient to iterative schema changes and that promotes code-first development, without using Object to Relational Mappers.
There are many open source implementations of NoSQL databases, which are created as open source contributions. Microsoft has implemented NoSQL databases only in the Azure space; otherwise, they only offer hosted implementations and no on-premise, packaged NoSQL databases.
Azure DocumentDb is the proposition of a NoSQL, JSON document-based database. As a hosted solution, you can interact with it only using the public REST API, other than the Portal. It is a highly reliable service, as Azure DocumentDb is natively deployed with replica support. Every write performed on the primary database is replicated on two secondary distributed replicas, in different fault domains. Every read is distributed over the three copies to avoid bottlenecks. To ensure performance, DocumentDb databases are hosted on machines with SSD drives.
To interact with DocumentDb from the Azure Portal, you need to create a Database Account:
From the Portal, select DocumentDb, where it is a DocumentDb Account:
To create a new account, we need to specify a name, a resource group, and a location. A database account is mainly two things:
As a unit of authorization, a database account provides two sets of keys. A master key is necessary to perform all read/write operations to the database account. A read-only key can be used to perform read-only queries.
As a unit of consistency, as we can select a global consistency for all databases and collections that it contains. You can select four levels of consistency:
Consistency may also be specified at the query level. Future releases will update this area.
A database account is not a database, while a database account can contain multiple databases. With a recent October 2016 update, portal presents an interface that invites users to create collections and not a database. This is because the notion of a database inside DocumentDb is a sort of placeholder, a namespace. It is now included inside the creation of collections.
If we have a relational database in mind, we may recall that we can store entities as rows inside tables. But collections are very different.
A table is a static container of homogenous and flat entities. A table ensures that every flat entity is composed of scalar properties as it cannot contain more complex structures, and it guarantees that all entities have the same properties, by name and by type. To create more complex schemas, a relational database uses multiple tables, relations, and referential integrity. Different entities are contained in different tables.
In DocumentDb, collections are schema-less containers of JSON documents. This means that a single collection can contain multiple different entities as it does not ensure any schema. DocumentDb cannot ensure that two distinct instances of the same kind of document, say an Internet Order, contain the same properties because of an error or specification updates.
Multiple collections inside a database are necessary, as a collection is both a unit of partitioning and a unit of throughput. It is a unit of partitioning because it can contain no more that 10Gb of data; with more than 10Gb of data, you must create multiple partitions. It is a unit of throughput because it can perform a limited number of operations. Operations are measured in terms of Resource Units (RU) and each operation can consume a different quantity of RU. Every collection has a quota of Request Units per second that it can spend in operations. This number is selected where a collection is created and, in some situations, can change if needed. This is a sample reference table containing costs in terms of operations performed on the collection:
|Operation||RU Consumed (estimated)|
|Reading a single 1KB document||1|
|Reading a single 2KB document||2|
|Query with a simple predicate for a 1KB document||3|
|Creating a single 1 KB document with 10 JSON properties (consistent indexing)||14|
|Create a single 1 KB document with 100 JSON properties (consistent indexing)||20|
|Replacing a single 1 KB document||28|
|Execute a stored procedure with two create documents||30|
These are only reference values because they depend on the real size of the document handled. An available web tool can help you more precisely calculate expenses to test with your document.
If in a second you try making requests for more than, say, 2500RUs, DocumentDb throttles the request. This is due to the multitenancy nature of DocumentDb. It must guarantee the performance declared when you create a collection; no more, no less. Another quota relates to size: A collection cannot be more than 10Gb. To use more space, you need to create other collections. This is why a collection is a unit of partitioning: You need to create multiple collections to handle all of the space you may need, and then you need to specify how to distribute documents over multiple collections. There are some recent updates in this field because you can use a client-side partition resolver in every query. As of late, you can also support server-side partitioning with partitioned collections.
To create a collection, go to the DocumentDb account:
To create our first documents, we can use Document Explorer (available under the left side navigation menu):
With Document Explorer, you can interactively write JSON documents directly inside a collection.
Here are a few additional details about writing inside of a collection:
Now we can try experimenting with Query Explorer. When you click Query Explorer, this opens a new blade that shows the query and the collection over which it is executed. Note that the query is executed inside a collection: It has no knowledge about other collections, so all documents must be contained in that collection.
Also, note that:
We can perform a more complex query if we look inside inner collections. For example, we can make a join because we need to create a different query focused on order item and not on the order. We want to create a query that lists ordered products independently from the order where they were defined.
DocumentDb is a great opportunity to approach a new database genre because its friction-less approach for developers uses objects that natively map with JSON and can be queried with an SQL-like language. DocumentDb is tailored for new Web and cloud-scale scenarios and for new cloud solutions that use multiple data sources.
In future articles, we’ll explore partitioning in more detail using DocumentDb from the REST API and Sdks.
Harvard Business Review recently estimated that some 90% of corporate training never gets applied on the job. Given the $200B training industry, that is a staggering amount of waste. One reason for the disconnect? Lack of context.Cloud Academy’s platform was built to make it extraor...
Microsoft Azure is the fastest growing cloud provider. Azure’s revenue grew an incredible 76% in the last quarter of 2018. As more and more businesses move their IT infrastructure to Microsoft’s cloud platform, the demand for Azure professionals keeps rising. Since there are relatively ...
So just what is Heroku? It's a service for developers eager to get their applications online without having to worry about infrastructure details.Metered, pay-as-you-go Cloud Computing services come in all kinds of flavors. Infrastructure as a Service (IaaS) offerings like AWS allow e...
Cloud Computing, like any computing, is a combination of CPU, memory, networking, and storage. Infrastructure as a Service (IaaS) platforms allow you to store your data in either Block Storage or Object Storage formats.Understanding the differences between these two formats - and how ...
The move to the cloud is picking up steam. As such, many corporations are beginning to find themselves supporting a mixture of on-prem apps as well as cloud apps. Users are finding that they need access to this mix of applications as well. As one would expect, this can become a challe...
Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2018, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the first time. In t...
This is the second of a two-part series covering Azure Stack. Our first post provided an introduction to Azure Stack. Why would your organization consider using Azure Stack? What are the key differences between Azure Stack and Microsoft Azure? In this post, we'll begin to answer bot...
Microsoft Ignite 2018 was a big success. Over 26,000 people attended Microsoft’s flagship conference for IT professionals in sunny Orlando, Florida. As usual, Microsoft made a huge number of announcements, ranging from minor to major in importance. To save you the trouble of sifting thr...
Cloud Academy is proud to be a sponsor of the Microsoft Ignite Conference to be held September 24 - 28 in Orlando, Florida. This is Microsoft’s biggest event of the year and is a great way to stay up to date on how to get the most from Microsoft’s products. In this post, I’ll help you p...
One of the main promises of cloud computing is access to nearly endless capacity. However, it doesn’t come cheap. With the introduction of Spot Instances for Amazon Web Services’ Elastic Compute Cloud (AWS EC2) in 2009, spot instances have been a way for major cloud providers to sell sp...
A Comparison of Machine Learning Services on AWS, Azure, and Google CloudArtificial intelligence and machine learning are steadily making their way into enterprise applications in areas such as customer support, fraud detection, and business intelligence. There is every reason to beli...
In on-premises environments, data security is typically a siloed activity, with a company's security team telling the internal technology groups (server administration, database, networking, and so on) what needs to be protected against intrusion.This approach is absolutely a bad idea...