Azure Data Implementation
The course is part of these learning pathsSee 3 more
Microsoft Azure offers services for a wide variety of data-related needs, including ones you would expect like file storage and relational databases, but also more specialized services, such as for text searching and time-series data. In this course, you will learn how to design a data implementation using the appropriate Azure services. Two services that are especially important are Azure SQL Database and Azure Cosmos DB.
Azure SQL Database is a managed service for hosting SQL Server databases (although it’s not 100% compatible with SQL Server). Even though Microsoft takes care of the maintenance, you still need to choose the right options to scale it and make sure it can survive failures.
Azure Cosmos DB is the first multi-model database that’s offered as a global cloud service. It can store and query documents, NoSQL tables, graphs, and columnar data. To get the most out of Cosmos DB, you need to know which consistency and performance guarantees to choose, as well as how to make it globally reliable.
Identify the most appropriate Azure services for various data-related needs
Design an Azure SQL Database implementation for scalability, availability, and disaster recovery
Design an Azure Cosmos DB implementation for cost, performance, consistency, availability, and business continuity
People who want to become Azure cloud architects
People preparing for a Microsoft Azure certification exam
General knowledge of IT architecture, especially databases
In this lesson, we’ll cover services that don’t fall into the usual categories of basic storage, transactional databases, or NoSQL datastores. These services help you work with data in other ways, such as finding, transforming, and analyzing it.
First up is Azure Data Catalog. Organizations typically have so much data in so many different places that it’s hard to find what you’re looking for. The purpose of Data Catalog is to act as an index to all of those data sources, so you can discover them. Of course, in order for this to work, your employees need to register their data sources in the catalog. The data itself stays where it is, but its location and the metadata about it get added to the catalog. The metadata includes things like column names and data types. Users can also add additional information about a data source, such as a description or some tags.
Once various data sources are registered, people can search the catalog to find what they’re looking for. They still need to open the data using another tool, though, since this is just a catalog.
Another way to deal with pockets of data is to collect it in either a data lake or a data warehouse. These serve two different, but related, needs.
Azure Synapse Analytics (formerly known as SQL Data Warehouse) is intended for SQL queries. That implies that it stores data in structured, relational tables. If you have raw data that’s not in a nicely structured format, then you’ll probably need to process it before you store it in Synapse Analytics.
Azure Data Lake Storage, on the other hand, will store any kind of data, whether it’s structured or not. For example, you could store everything from documents to images to social media streams.
Data warehouses are generally used for business reporting, while data lakes are more often used for data analytics and exploration. In fact, one common setup is to process data in the data lake and then export it to the data warehouse.
The two services are designed to work with different types of software, too. Synapse Analytics is built on SQL Server, so it works well with that ecosystem of software. Data Lake Storage, in contrast, is built to work with Hadoop and Spark. That’s because Hadoop and Spark excel at processing unstructured data.
One final difference is that Synapse Analytics is certified for compliance with over 20 standards, including HIPAA. Data Lake Storage does not have regulatory compliance. This is another reason why it makes sense to use Synapse Analytics to serve data to a wider audience.
Both services are designed for performing massive queries at high speed. With Synapse Analytics, you write queries using T-SQL. With Azure Data Lake Storage, you can run queries using Apache Hadoop or Apache Spark.
If you want to use Spark to process your data, then there are quite a few options. First, Microsoft has now integrated Spark with Synapse Analytics, so you can use Spark to process your data before adding it to your data warehouse.
Two other services you can use are Azure Databricks and Azure HDInsight. Databricks is a managed Spark implementation that was developed by the people who created Apache Spark. HDInsight supports a wide variety of open-source big data frameworks, including Hadoop, Spark, Hive, Storm, and many others.
One difference between Databricks and HDInsight is ease of use. For example, to run a processing job with either service, you need to spin up a cluster, but Azure Databricks can be configured to automatically spin up a cluster when a job runs and shut it down after the job is finished. In contrast, HDInsight doesn’t have a built-in way to spin up a cluster automatically. So if you need to run HDInsight jobs quite often, you can leave a cluster running all the time, which would be expensive, or you could spin clusters up and down as you need them, which would be kind of a pain.
One way to make HDInsight work in a more automated fashion is to use yet another service, Azure Data Factory. It lets you create workflows to automate data movement and data transformation. One of its many capabilities is spinning up and down HDInsight clusters as needed, but it can do far more than that.
With Data Factory, you can create data processing pipelines. For example, a pipeline could copy data from SQL Server to Data Lake Storage, run a Spark job on the data using an HDInsight cluster, and store the results in Synapse Analytics, all without any human intervention. It can even automate machine learning jobs. It’s a great data processing tool.
One more data analytics tool is Azure Analysis Services. It lets you create data models that make sense of existing data. One of the problems with the multitude of data in organizations is that it can be hard to understand how all of that data relates to the real world. Using a data model is easier than working with the raw data. Analysis Services also makes browsing large amounts of data faster because it uses in-memory caching.
However, end-users don’t browse directly through Analysis Services. Instead, they use one of the supported client tools, such as Power BI, Tableau, or Excel.
And that’s it for Azure Data Services.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).