Azure Data Implementation
The course is part of these learning pathsSee 4 more
Microsoft Azure offers services for a wide variety of data-related needs, including ones you would expect like file storage and relational databases, but also more specialized services, such as for text searching and time-series data. In this course, you will learn how to design a data implementation using the appropriate Azure services. Two services that are especially important are Azure SQL Database and Azure Cosmos DB.
Azure SQL Database is a managed service for hosting SQL Server databases (although it’s not 100% compatible with SQL Server). Even though Microsoft takes care of the maintenance, you still need to choose the right options to scale it and make sure it can survive failures.
Azure Cosmos DB is the first multi-model database that’s offered as a global cloud service. It can store and query documents, NoSQL tables, graphs, and columnar data. To get the most out of Cosmos DB, you need to know which consistency and performance guarantees to choose, as well as how to make it globally reliable.
Identify the most appropriate Azure services for various data-related needs
Design an Azure SQL Database implementation for scalability, availability, and disaster recovery
Design an Azure Cosmos DB implementation for cost, performance, consistency, availability, and business continuity
People who want to become Azure cloud architects
People preparing for a Microsoft Azure certification exam
General knowledge of IT architecture, especially databases
In this lesson, we’ll cover services that don’t fall into the usual categories of basic storage, transactional databases, or NoSQL datastores. These services help you work with data in other ways, such as finding, transforming, and analyzing it.
First up is Azure Data Catalog. Organizations typically have so much data in so many different places that it’s hard to find what you’re looking for. The purpose of Data Catalog is to act as an index to all of those data sources, so you can discover them. Of course, in order for this to work, your employees need to register their data sources in the catalog. The data itself stays where it is, but its location and the metadata about it get added to the catalog. The metadata includes things like column names and data types. Users can also add additional information about a data source, such as a description or some tags.
Once various data sources are registered, people can search the catalog to find what they’re looking for. They still need to open the data using another tool, though, since this is just a catalog.
Another way to deal with pockets of data is to collect it all in either a data lake or a data warehouse. These serve two different, but related, needs.
Data warehouses store data in structured, relational tables, while data lakes store any kind of data, whether it’s structured or not. For example, you could store everything from documents to images to social media streams.
Data warehouses are generally used for business reporting, while data lakes are more often used for data analytics and exploration. In fact, one common setup is to process data in a data lake and then export it to a data warehouse. Both types of services are designed for performing massive queries at high speed.
Azure Data Lake Storage is built on top of Azure Blob Storage, and it provides the additional capabilities needed for a modern data lake. Its most important feature is that it’s compatible with Hadoop and Spark, which are the most popular open-source software systems for doing data analytics.
Azure Synapse Analytics (formerly known as SQL Data Warehouse) offers an interesting mix of data warehouse and data lake capabilities. If you need a data warehouse, you can create a SQL pool, which lets you run SQL queries on structured, relational tables. If you want a data lake, then you can create a Spark pool, which lets you use Spark to query both structured and unstructured data.
Spark has become so popular that Microsoft has many services that let you use Spark for data analytics. In addition to Data Lake Storage and Synapse Analytics, you can also use Azure Databricks and Azure HDInsight. Databricks is a managed Spark implementation that was developed by the people who created Apache Spark. HDInsight supports a wide variety of open-source big data frameworks, including Hadoop, Spark, Hive, Storm, and many others.
One difference between Databricks and HDInsight is ease of use. For example, to run a processing job with either service, you need to spin up a cluster, but Azure Databricks can be configured to automatically spin up a cluster when a job runs and shut it down after the job is finished. In contrast, HDInsight doesn’t have a built-in way to spin up a cluster automatically. So if you need to run HDInsight jobs quite often, you can leave a cluster running all the time, which would be expensive, or you could spin clusters up and down as you need them, which would be kind of a pain.
One way to make HDInsight work in a more automated fashion is to use yet another service, Azure Data Factory. It lets you create workflows to automate data movement and data transformation. One of its many capabilities is spinning up and down HDInsight clusters as needed, but it can do far more than that.
With Data Factory, you can create data processing pipelines. For example, a pipeline could copy data from SQL Server to Data Lake Storage, run a Spark job on the data using an HDInsight cluster, and store the results in Synapse Analytics, all without any human intervention. It can even automate machine learning jobs. It’s such a useful tool that Microsoft even includes a stripped-down version of it in Synapse Analytics
One more data analytics tool is Azure Analysis Services. It lets you create data models that make sense of existing data. One of the problems with the multitude of data in organizations is that it can be hard to understand how all of that data relates to the real world. Using a data model is easier than working with the raw data. Analysis Services also makes browsing large amounts of data faster because it uses in-memory caching.
However, end users don’t browse directly through Analysis Services. Instead, they use one of the supported client tools, such as Power BI, Tableau, or Excel.
And that’s it for Azure Data Services.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).