Azure App Service in action
In the real world, there are generally two possibilities for hosting your web applications or APIs in the cloud. You can follow the classic model a...Learn More
Azure DocumentDb: In today’s database wars, we often find developers and database administrators on opposite sides. Developers often experiment with new libraries, software, and methodologies to write simpler and elegant code, especially when talking about data management, and tend to throw away legacy products quite easily. On the other side, database administrators with skills in the relational space tend to be more conservative and don’t believe that there could be something different from their beloved SQL language.
However, there are some good reasons for embracing the new opportunities presented by new NoSQL databases. In this post, we’d like to clear up some of the biggest misconceptions about NoSQL databases and introduce how developers can use document-based databases, specifically, Microsoft Azure’s DocumentDb.
The 1980s were the pioneering era of Personal Computing (PC), the period when computers first began to show their potential. In the enterprise world, the PC empowered the daily routines of every office by structuring and automating many daily organizational tasks, often through applications such as Enterprise Resource Planning (ERP) and Accounting.
If the ’80s were about PCs, the ’90s were the decade of connecting computers within the enterprise using a LAN. Despite the introduction of the Web and the browser, ERP and Accounting are essentially the same as they were 30 years ago.
Today, we make relational databases responsible for data integrity by giving them the task of structuring data in normalized tables and relations. We assume that if data is contained in a relational database, we will always have full access to it to do whatever we want. In reality, if we could check the implementations of many applications, we would find that the database is not really responsible for data correctness and integrity. Developers tend to use different code for this. But I don’t want to discuss how a relational database can be used to do what it is supposed to do. I don’t want to move ERP into the NoSQL world because I believe that the relational model is better for ERP. I think that this is also true for Accounting software.
Instead, I want to talk about a sort of laziness in how developers approach new technologies and the feeling that we are moving away from relational databases.
The biggest misconceptions about databases come from the Internet era. It is not about ERPs and Accounting software, which are fine just the way they are. Instead, many of today’s misconceptions about databases come from a whole new class of applications that are being addressed by a combination of new (web-based) and old (relational database) technologies. These applications are still useful today, but there are other possibilities.
For example, suppose you want to implement a (now) classic e-commerce website and you want to design for the purchased order. You create a flash sale event that offers a 50% discount for purchases within an hour. Because you sell worldwide, you could receive up to 10 times the normal number of orders during this time.
You have two options. If this is a sporadic event, you can choose to scale up the database server to handle the increased requests. You will pay more for it, but the success of the campaign should cover the additional expense.
You can also decide to schedule this type of campaign more frequently over time. To optimize the expenses of scaling up, you need to think of a better way for your application to handle this scenario.
With a relational database, this is one way to model the purchase order database:
You have an InternetOrders entity table that represents the order, with order total and other order-related information. It refers to a Customer table that represents the user that made the orders. Another table, InternetOrderDetails, contains the rows that represent all products purchased inside a single order, with quantity and price. Each detail references a Products table with unit price and other information related to the order.
To carry out an order, you need to perform (n+1) insertions in tables and (n+1) check constraints to verify that the data is integral. Once a user has submitted the checkout command from the web server, all operations are executed on the backend. Then, the backend service will carry the order to the relational database. This is just a simplification of the process, as a real world order would require additional operations. Still, you can perceive that in a flash sale, the number of database operations will greatly increase and the database operation will suffer.
We can try to develop some hypotheses as to why this is the case. We develop frontend and backend applications where separation is about organizational needs. Therefore, we have control over the data. We already have correct Customer and Product catalogs. We also need to include the current customer and product information inside an InternetOrder, as they can change over time. However, for traceability reasons, we need to do so at the moment of the order. For example, if Customer information changes, we don’t need to modify past orders.
In this scenario, if we use a JSON representation, we can write an InternetOrder in the following way:
Suppose you have a database that can save this document in a single write without separating every single entity (represented by the curly braces) in distinct tables and different insert operations. Also, suppose that you need a daily count of how many T-shirts you sold on a previous day. To do so, you need to aggregate orders on a product basis and discover the total quantity and costs, but you would do it the day after, not while customers are placing their orders. This is what NoSQL databases are good at.
They use JSON format to persist data in an aggregated format. If the row inside a relational table is an entity, the Data Transfer Object (DTO) is the data aggregation format that JSON prefers.
NoSQL is developed with scalability in mind to optimize the ingestion of web data. NoSQL supports eventual consistency, which means that queries will be consistent with write in a non-predictable time. It is substantially different from the ACID consistency that relational databases implement. You have an unpredictable amount of time when queries will be consistent with writes.
This happens in many scenarios. Think about the Internet of Things. You need to collect lots of data from connected devices all around the world, but you don’t need to analyze it in real time; you don’t need precise data immediately, but perhaps just some statistics. Essentially, you need to be able to reliably collect all data, without any loss.
We say that NoSQL databases are partition-aware. They can subdivide the entire data store into partitions and each one can be distributed in different hosts to scale out the ingestion capacity. This is quite different from relational databases that must scale up because they cannot create partitions to support distributed writes.
That is what NoSQL databases are good at, but it also requires some operational skills, as partitioning implies handling a complex service infrastructure.
In general, developers are afraid to approach a database because they cannot reuse the SQL skills that they probably learned at school. Developers are also focused on code, so they don’t like thinking about operational issues such as distributed services and partitioning. However, they do appreciate a schema-less Db that is resilient to iterative schema changes and that promotes code-first development, without using Object to Relational Mappers.
There are many open source implementations of NoSQL databases, which are created as open source contributions. Microsoft has implemented NoSQL databases only in the Azure space; otherwise, they only offer hosted implementations and no on-premise, packaged NoSQL databases.
Azure DocumentDb is the proposition of a NoSQL, JSON document-based database. As a hosted solution, you can interact with it only using the public REST API, other than the Portal. It is a highly reliable service, as Azure DocumentDb is natively deployed with replica support. Every write performed on the primary database is replicated on two secondary distributed replicas, in different fault domains. Every read is distributed over the three copies to avoid bottlenecks. To ensure performance, DocumentDb databases are hosted on machines with SSD drives.
To interact with DocumentDb from the Azure Portal, you need to create a Database Account:
From the Portal, select DocumentDb, where it is a DocumentDb Account:
To create a new account, we need to specify a name, a resource group, and a location. A database account is mainly two things:
As a unit of authorization, a database account provides two sets of keys. A master key is necessary to perform all read/write operations to the database account. A read-only key can be used to perform read-only queries.
As a unit of consistency, as we can select a global consistency for all databases and collections that it contains. You can select four levels of consistency:
Consistency may also be specified at the query level. Future releases will update this area.
A database account is not a database, while a database account can contain multiple databases. With a recent October 2016 update, portal presents an interface that invites users to create collections and not a database. This is because the notion of a database inside DocumentDb is a sort of placeholder, a namespace. It is now included inside the creation of collections.
If we have a relational database in mind, we may recall that we can store entities as rows inside tables. But collections are very different.
A table is a static container of homogenous and flat entities. A table ensures that every flat entity is composed of scalar properties as it cannot contain more complex structures, and it guarantees that all entities have the same properties, by name and by type. To create more complex schemas, a relational database uses multiple tables, relations, and referential integrity. Different entities are contained in different tables.
In DocumentDb, collections are schema-less containers of JSON documents. This means that a single collection can contain multiple different entities as it does not ensure any schema. DocumentDb cannot ensure that two distinct instances of the same kind of document, say an Internet Order, contain the same properties because of an error or specification updates.
Multiple collections inside a database are necessary, as a collection is both a unit of partitioning and a unit of throughput. It is a unit of partitioning because it can contain no more that 10Gb of data; with more than 10Gb of data, you must create multiple partitions. It is a unit of throughput because it can perform a limited number of operations. Operations are measured in terms of Resource Units (RU) and each operation can consume a different quantity of RU. Every collection has a quota of Request Units per second that it can spend in operations. This number is selected where a collection is created and, in some situations, can change if needed. This is a sample reference table containing costs in terms of operations performed on the collection:
|Operation||RU Consumed (estimated)|
|Reading a single 1KB document||1|
|Reading a single 2KB document||2|
|Query with a simple predicate for a 1KB document||3|
|Creating a single 1 KB document with 10 JSON properties (consistent indexing)||14|
|Create a single 1 KB document with 100 JSON properties (consistent indexing)||20|
|Replacing a single 1 KB document||28|
|Execute a stored procedure with two create documents||30|
These are only reference values because they depend on the real size of the document handled. An available web tool can help you more precisely calculate expenses to test with your document.
If in a second you try making requests for more than, say, 2500RUs, DocumentDb throttles the request. This is due to the multitenancy nature of DocumentDb. It must guarantee the performance declared when you create a collection; no more, no less. Another quota relates to size: A collection cannot be more than 10Gb. To use more space, you need to create other collections. This is why a collection is a unit of partitioning: You need to create multiple collections to handle all of the space you may need, and then you need to specify how to distribute documents over multiple collections. There are some recent updates in this field because you can use a client-side partition resolver in every query. As of late, you can also support server-side partitioning with partitioned collections.
To create a collection, go to the DocumentDb account:
To create our first documents, we can use Document Explorer (available under the left side navigation menu):
With Document Explorer, you can interactively write JSON documents directly inside a collection.
Here are a few additional details about writing inside of a collection:
Now we can try experimenting with Query Explorer. When you click Query Explorer, this opens a new blade that shows the query and the collection over which it is executed. Note that the query is executed inside a collection: It has no knowledge about other collections, so all documents must be contained in that collection.
Also, note that:
We can perform a more complex query if we look inside inner collections. For example, we can make a join because we need to create a different query focused on order item and not on the order. We want to create a query that lists ordered products independently from the order where they were defined.
DocumentDb is a great opportunity to approach a new database genre because its friction-less approach for developers uses objects that natively map with JSON and can be queried with an SQL-like language. DocumentDb is tailored for new Web and cloud-scale scenarios and for new cloud solutions that use multiple data sources.
In future articles, we’ll explore partitioning in more detail using DocumentDb from the REST API and Sdks.
Microsoft Azure is one of the hottest cloud services on the planet, and it’s growing at a phenomenal rate. This rapid growth has created a huge demand for people who know how to administer and manage Azure implementations.To make it easier for employers to verify the skills of Azure...
Microsoft has offered Azure certification exams for years, but until recently, they were all meant for technical IT professionals. Now non-technical professionals, such as salespeople and managers, can take the new AZ-900 exam to prove their understanding of Azure fundamentals. People w...
Archive StorageArchive Storage offers the lowest storage costs of all Azure storage. Its retrieval costs, however, are higher when compared to Hot and Cool storage. The archive tier of storage is designed for data that can tolerate several hours of latency when being retrieved. It’s a...
Welcome to part one of our series on Azure Storage. Stay tuned for the second part. An Overview of Azure StorageMicrosoft Azure Storage is a cloud-based storage offering that provides multiple storage solutions for organizations. In addition to a massively scalable object store for ...
Harvard Business Review recently estimated that some 90% of corporate training never gets applied on the job. Given the $200B training industry, that is a staggering amount of waste. One reason for the disconnect? Lack of context.Cloud Academy’s platform was built to make it extraor...
Microsoft Azure is the fastest growing cloud provider. Azure’s revenue grew an incredible 76% in the last quarter of 2018. As more and more businesses move their IT infrastructure to Microsoft’s cloud platform, the demand for Azure professionals keeps rising. Since there are relatively ...
So just what is Heroku? It's a service for developers eager to get their applications online without having to worry about infrastructure details.Metered, pay-as-you-go Cloud Computing services come in all kinds of flavors. Infrastructure as a Service (IaaS) offerings like AWS allow e...
Cloud Computing, like any computing, is a combination of CPU, memory, networking, and storage. Infrastructure as a Service (IaaS) platforms allow you to store your data in either Block Storage or Object Storage formats.Understanding the differences between these two formats - and how ...
What are Azure Blueprints?Blueprints, in the traditional sense, are used by architects and engineers to design and build new things. They are used to ensure that the final products are built to specifications and in compliance with certain standards and requirements.Azure Bluepri...
The move to the cloud is picking up steam. As such, many corporations are beginning to find themselves supporting a mixture of on-prem apps as well as cloud apps. Users are finding that they need access to this mix of applications as well. As one would expect, this can become a challe...
As Head of Content at Cloud Academy I work closely with our customers and my domain leads to prioritize quarterly content plans that will achieve the best outcomes for our customers.We started 2018 with two content objectives: To show customer teams how to use Cloud Services to solv...
IoT, or the ‘Internet of Things’, is an intriguing and rapidly growing technology that's bringing significant change to important elements of modern life. According to Gartner, IoT security spending alone is set to reach $1.5 billion during 2018.Like many newly minted terms, the def...