Getting the Most from DocumentDB
An Introduction to Azure DocumentDB
It's been common, if inconsistently applied, knowledge for many years that relational databases are a less-than-ideal fit for some types of software problems. Indeed, entire categories of software development tooling, such as object-relational mappers (ORMs), exists to bridge the gap between highly normalized relational data and in-memory, object-oriented representations. In practice, ORMs can create as much complexity as they alleviate, so developers began looking at the relational database itself as ripe for potential disruption.
Thus came the rise of NoSQL and databases that eschew the traditional rows/columns/tables/foreign keys metaphor for other choices like JSON document stores, graph databases that represent data and relationships as nodes with connecting edges, key/value stores that act as a glorified hashtable, and others. The wide range of options meant you could choose the right tool for your particular needs, instead of trying to squeeze a relational database square peg into your application's round hole. Solutions like MongoDB, Cassandra, Redis, and Neo4j rose to prominence and became de facto industry standards for developers interested in leveraging the power and flexibility NoSQL.
While NoSQL was a boon to software developer productivity, the initial product offerings did little to alleviate the administrative burden of managing your database. Server provisioning, backups, data security at-rest and in-transit... all these challenges (and many more) remained as developers adopted NoSQL in greater numbers. Fortunately for them and all of us, the rise of the cloud and managed database service offerings like Azure DocumentDB brought us the best of both worlds: fast, flexible, infinitely-scalable NoSQL with most of the administrative headaches assumed by a dedicated team of experts from Microsoft. You focus on your data and your application, and rely on a 99.99% SLA for the rest!
In this "Introduction to Azure DocumentDB" course, you’ll learn how to use Azure DocumentDB (DocDB) in your applications. You'll create DocDB accounts, databases, and collections. You'll perform ad-hoc and application-based queries, and see how features like stored procedures and MongoDB protocol support can help you. You'll also learn about ideal DocDB use cases and the pricing model. By the end of this course, you’ll have a solid foundation to continue exploring NoSQL and DocumentDB.
An Introduction to Azure DocumentDB: What You'll Learn
|Lecture||What you'll learn|
|Intro||What to expect from this course|
|DocumentDB Overview||A high-level overview of the DocumentDB feature set|
|Overview of Managing DocumentDB||A discussion of DocumentDB features for managing resources, data, scalability, configuration, and so on|
|Creating an Account||Creating a top-level DocDB account in the Azure portal|
|Creating a Collection||Creating and configuring a DocDB collection in the Azure portal|
|Importing Data||Discussion and demonstration of moving data into a DocDB collection|
|Overview of Developing with DocumentDB||A discussion of DocumentDB features from a development point of view|
|SQL Queries||How to author queries in the Azure portal|
|Programming with DocumentDB||Reading and writing data in code, using the .NET SDK|
|Stored Procedures||Authoring DocDB stored procedures and executing them using the DocDB REST API|
|MongoDB Protocol Support||Configuring and using DocDB's MongoDB protocol support|
|Use Cases||A brief discussion of scenarios well-suited for DocDB use|
|Pricing||A review of the DocDB pricing model, and discussion of cost estimation and Total Cost of Ownership|
|Ecosystem Integration||A short review of DocDB integration with other Azure services|
|Summary||Course wrap up|
If you have thoughts or suggestions for this course, please contact Cloud Academy at firstname.lastname@example.org.
Alright. So, we've got a DocumentDB account, and we've added a new collection for that account. Now we need to import some data. There are several different options for importing data into DocumentDB. I'll cover a few of those options now. The easiest one would be to click on the Document Explorer button on the left hand side of the main DocumentDB account view. And you'll notice that, in this case, I'm looking at my postal codes collection. I already have some documents in this collection. But if I wanted to, I can upload additional JSON documents by clicking on this upload button. This simply allows me to specify up to 100 separate JSON documents from my local machine, and upload those to this collection. Now, these JSON documents have to be a max, they can only be a maximum of two megabytes in size, and they do have to consist of well-formed JSON.
A much more flexible and powerful option for importing data into DocumentDB is the use of the DocumentDB Data Migration Tool. This is an open-source tool that was created by the DocumentDB team to do exactly what it sounds like: to move data into DocumentDB from a variety of source options. You can download this tool for free at the URL specified on the screen. Once you download it and fire it up, you're presented with some options for configuring both the source of your information: the source of the data that you want to import, as well as the target information: where you're going to import the data into DocDB. For source information, you have a number of options to specify. You can either specify local JSON files or CSV files. You can specify a MongoDB connection, SQL Server, Azure Tables, even DynamoDB from Amazon Web Services. You can also, for that matter, import data from another DocumentDB instance. Each of these options, specified options, has additional configuration parameters that you might have to specify. But it's very simple and straightforward. Once you specify your source, you would then move onto the target. Similarly here, this is where you would configure some information regarding where you want your data to reside in DocumentDB. And again, you have a few options here. You can specify either a partition collection or a collection that has a single partition. You can also specify a JSON file for doing, kind of, local debugging, if you're trying to troubleshoot something with the import.
So in the case of specifying, say, a partition collection, you'll have a few different options. You would have to configure things like connection strings, the name of the collection you're going to target, the partition key, some of these options, you have the ability to target either an existing collection or even create a new collection as part of the import process. And so things like the partition key, the provision throughput value that you can specify here, the ID field, these are all things that would apply only if you're, of course, creating a new collection as part of this import process. One other thing I'll note is that this tool happens to have been written in .net, using the DocumentDB.net SDK. And interestingly, that SDK supports parallel operations against DocumentDB. And so here you have the ability to specify the number of parallel requests that you might want to support at any given time during your import run. And this allows you to trade off increased performance, of course, for potentially, more consumption of request units at any given time, to run your import.
The last option for importing data into DocumentDB, that I'd like to cover, is the use of Azure Data Factory to import data from a variety of sources into DocDB. This is arguably the most powerful and flexible option, but it's also a bit more complex than some of the others that we've looked at. A full treatment of Azure Data Factory is certainly beyond the scope of this course. But I would like to walk you through a brief sample of what this looks like, just to sort whet your appetite and give you some motivation to perhaps dig in further on your own; take a look at what this looks like. So, in order to create a Data Factory job, you would click on the upper-left hand side. Just type Data Factory. You can certainly find it in the menu if you prefer to use the mouse. Click on Data Factory here, and then click Create. This will present you with a very simple UI, where you create a name for your Data Factory job, specify its resource group and its location. I've already created one of these, so I'm going to close out of this, and we'll just navigate to the one that I've already created. Back up a little bit. And go back into resource groups. Here we are. I've called it testjob. So I'm going to show you the preview user interface for configuring Data Factory jobs in sort of a graphical format. Ultimately, all Data Factory jobs, or pipelines, as they're called, are expressed as JSON. And so you can author these pipelines just using a text editor, if you wish, or using the JSON editor directly in the browser. I'm going to use a preview of the graphical user interface for this, just to sort of give you a sense of what this looks like, and make for a slightly more compelling demo.
So I'm presented with the initial set of properties that I want to configure for my job. And I'm going to specify, instead of having this job run on a regular schedule, I'd like to run it right now. So I'm going to select that option and then click Next. First thing I need to do is configure my source data. Now I happen to have an existing data source already that I want to use. I've uploaded... I've taken some JSON data, which again, corresponds to United States ZIP codes, and I've converted that into CSV format. And I've uploaded that to a storage account in my Azure subscription. And so what I'd like to do is I'd like to point this job at that blob, that CSV blob, and use that as the source data that I'm then going to import into DocumentDB. Of course, you have other options. You can copy data from a variety of sources, using Data Factory. This is merely one possible option. So I'll click ImportJsonBlobs, which is the name of my account. Yes, I want to use an Azure Blob. Click Next. So then I'm presented with, sort of, the hierarchy of blobs in this storage account. And so, I'll double-click on my import data container, and you can see, I have two files in here. I actually have a JSON file, but I don't want to use that one. I want to use the CSV file, which is already highlighted. So I'll click that and say Choose. You can see that I selected that one up here. The rest of these options are fine, so we'll click through here.
So now what's happening is the... Azure is kind of chewing through that CSV file, and it's going to present me with a kind of an initial sense of what the schema looks like, and kind of give me an idea of what I can expect as far as import data. This allows me to sort of do a bit of a sanity check, and say yes, this is the data that I'm sort of expecting, or no, something's wrong, I've picked the wrong file. That sort of thing. These options here at the top are all fine for a CSV document. Certainly, if you have something different, like a tab-delimited document, or something else that may require some more processing, you can insert extra steps along the pipeline. But this, of course, would be a very simple one. So, the one thing that I would like to do is I'd like to change these column names and make these a little bit more relevant to my data. So I'm gonna click on SCHEMA, and then say edit. Now I happen to know the format of this CSV document, so I'm just going to kind of move through this. I'll say the first column is the actual ZIP code itself, which is definitely an integer. The second one is latitude. And... Oh, I'm sorry. The second one is actually city. The third one is latitude. The fourth one is longitude. Then we have population. And the last one is the state in the U.S., where this ZIP code lives. So, that looks correct. So I'll click Next. So now I want to specify my destination. Now of course, my destination, I can pick from a variety of options here, but ultimately, I care about DocDB, so I'm going to pick an existing connection which I've already previously configured. So I'll pick my DocDB output connection, and click Next. And looks like all is well.
Okay. So here, I have the option of either specifying an existing collection that I want to import into, or I can create a new one. In this case, I'm gonna create a new one, and I will just call this one, let's see, we'll call it zipcodes2. And the nesting separator doesn't matter in this case. I'm only going to have a single level of properties in my JSON. But if you happen to have multiple levels, then you can specify the separator that you want to use for that. More on that in just a second. So I'll click Next here. Okay, so the next thing we need to do is we need to map the input schema to the output schema. In my case, this is very simple and straightforward because what I really want to do is do a direct mapping. I'm not really interested in changing column names, or property names, or anything like that. And I certainly want to include all of these columns from the input to my output. So I'm just basically going to leave everything as is, and not touch it. You have the option to specify a few additional advanced settings. Again, you have support for things like copy and parallel So if you wanna speed up the import process a bit, you can do that. I've run this job a couple of times, just testing it out. And it took about nine or 10 minutes or so to run. So we probably won't sit here and wait for it to actually go. But here, if we click on that last tab, we get to the summary, and we can see essentially what we're going to do. This just gives you a brief description of the source and the destination, the locations where these things reside, as well as some of the settings. And we're going to run this right away, as soon as we click Next. So I'll click Next. And again, now we're... Data Factory is doing some basic validation of the pipeline that we've just created. Looks like validation passed, so that's a good sign.
So now it's actually creating the pipelines, the physical artifacts that are needed to run this job under the covers, and then it'll kick off the job. And again, this takes about, in my particular case, I mean, it certainly depends upon the amount of processing you have to do and how much data you're moving around; which regions you're moving it to and from. But in my case this took about 10 minutes last time. So we'll let this run for a while, and then in a few moments we'll fast forward and we'll take a look at the results. Okay. Just to close the loop on this and let you know that I have nothing up my sleeve, our job is completed, our Data Factory import job is completed. So we've navigated back to the account view for DocumentDB. Let's click on Document Explorer. And we should see, yes, now we have a new collection called zipcodes2. When I click on that, wait a second, and sure enough, we see a bunch of documents in here. So I'm just gonna click on a random one. And yeah, we see some JSON that corresponds to this document. We have our ZIP code, city, latitude, longitude, population, and state. And we also have an auto-generated ID that the job created for us. Or DocumentDB created for us when it inserted the document into the collection.
About the Author
Josh Lane is a Microsoft Azure MVP and Azure Trainer and Researcher at Cloud Academy. He’s spent almost twenty years architecting and building enterprise software for companies around the world, in industries as diverse as financial services, insurance, energy, education, and telecom. He loves the challenges that come with designing, building, and running software at scale. Away from the keyboard you'll find him crashing his mountain bike, drumming quasi-rythmically, spending time outdoors with his wife and daughters, or drinking good beer with good friends.