Azure Data Fundamentals
Microsoft Azure offers services for a wide variety of data-related needs, including ones you would expect like file storage and relational databases, but also more specialized services, such as for text searching and time-series data. In this course, you will learn which services to choose when implementing a data infrastructure on Azure. Two services that are especially important are Azure SQL Database and Azure Cosmos DB.
Identify the most appropriate Azure services for various data-related needs
People who want to learn Azure fundamentals
General knowledge of IT architecture, especially databases
In this lesson, we’ll cover services that don’t fall into the usual categories of basic storage, transactional databases, or NoSQL datastores. These services help you work with data in other ways, such as finding, transforming, and analyzing it.
First up is Azure Data Catalog. Organizations typically have so much data in so many different places that it’s hard to find what you’re looking for. The purpose of Data Catalog is to act as an index to all of those data sources, so you can discover them. Of course, in order for this to work, your employees need to register their data sources in the catalog. The data itself stays where it is, but its location and the metadata about it get added to the catalog. The metadata includes things like column names and data types. Users can also add additional information about a data source, such as a description or some tags.
Once various data sources are registered, people can search the catalog to find what they’re looking for. They still need to open the data using another tool, though, since this is just a catalog.
Another way to deal with pockets of data is to collect it in either a data lake or a data warehouse. These serve two different but related needs.
Azure SQL Data Warehouse gives us a clue about its purpose in the name itself. It’s intended for SQL queries. That also implies that it stores data in structured, relational tables. If you have raw data that’s not in a nicely structured format, then you’ll probably need to process it before you store it in SQL Data Warehouse.
Azure Data Lake Storage, on the other hand, will store any kind of data, whether it’s structured or not. For example, you could store everything from documents to images to social media streams.
Data warehouses are generally used for business reporting, while data lakes are more often used for data analytics and exploration. In fact, one common setup is to process data in the data lake and then export it to the data warehouse.
The two services are designed to work with different types of software, too. SQL Data Warehouse is built on SQL Server, so it works well with that ecosystem of software. Data Lake Storage, in contrast, is built to work with Hadoop. That’s because Hadoop excels at processing unstructured data.
One final difference is that SQL Data Warehouse is certified for compliance with over 20 standards, including HIPAA. Data Lake Storage does not have regulatory compliance. This is another reason why it makes sense to use SQL Data Warehouse to serve data to a wider audience.
Both services are designed for performing massive queries at high speed. With SQL Data Warehouse, you write queries using T-SQL. With Azure Data Lake Storage, you can run queries using Hadoop, but you’ll get higher performance by using Azure Data Lake Analytics. It’s a pay-as-you-go service that lets you write queries using U-SQL, which is a combination of SQL and C#.
If you do want to use Hadoop to process your data, then you can use Azure HDInsight. It supports a wide variety of open-source big data frameworks, including Hadoop, Spark, Hive, Storm, and many others. You don’t even have to use Data Lake Storage if you don’t want to because it also supports Azure Storage.
One difference between Data Lake Analytics and HDInsight is ease of use. With HDInsight, you need to spin up a cluster to run a processing job. So if you need to run jobs quite often, you can leave a cluster running all the time, which could be expensive, or you could spin clusters up and down as you need them, which would be kind of a pain.
One way to make HDInsight work more like Data Lake Analytics is to use yet another service, Azure Data Factory. It lets you create workflows to automate data movement and data transformation. One of its many capabilities is spinning up and down HDInsight clusters as needed, but it can do far more than that.
With Data Factory, you can create data processing pipelines. For example, a pipeline could copy data from SQL Server to Data Lake Storage, run a Spark job on the data using an HDInsight cluster, and store the results in SQL Data Warehouse, all without any human intervention. It can even automate Data Lake Analytics queries and machine learning jobs. It’s a great data processing tool.
One more data analytics tool is Azure Analysis Services. It lets you create data models that make sense of existing data. One of the problems with the multitude of data in organizations is that it can be hard to understand how all of that data relates to the real world. Using a data model is easier than working with the raw data. Analysis Services also makes browsing large amounts of data faster because it uses in-memory caching.
However, end-users don’t browse directly through Analysis Services. Instead, they use one of the supported client tools, such as Power BI, Tableau, or Excel.
And that’s it for Azure Data Services.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).