This course is a quick introduction to Microsoft’s Azure Synapse Analytics. It covers serverless SQL pools, dedicated SQL pools, Spark pools, and Synapse Pipelines. It does not cover more advanced topics, such as optimization and security, because these topics will be covered in another course.
- Create and use serverless SQL pools, dedicated SQL pools, Spark pools, and Synapse Pipelines
- Anyone who would like to learn the basics of using Azure Synapse Analytics
- Experience with databases
- Experience with SQL (not mandatory)
- A Microsoft Azure account is recommended if you want to do the demos yourself (sign up for a free trial at https://azure.microsoft.com/free if you don’t have an account)
When databases first became popular, they were primarily used for transaction processing (and that’s still the case today). But managers also needed to analyze data and create reports, which is difficult to do when the data resides in numerous databases across an organization. So data warehouses were created to collect data from a wide variety of sources, and they were designed specifically for reporting. The language used to query a data warehouse is normally SQL.
More recently, there has been a growing need to analyze unstructured data, such as documents and images, as well. Data lakes were created to collect all kinds of data in one place, and they were designed for big data analytics. The most common way to process data in a data lake is to use Apache Spark, which is an open-source analytics engine for big data.
Microsoft has offered a variety of separate solutions to meet these two different needs over the years, but now it has come out with a new analytics service that works with both structured and unstructured data. It’s called Azure Synapse Analytics. If you need a data warehouse, you can create a dedicated SQL pool, which lets you run SQL queries on structured, relational tables. If you want a data lake, then you can create a Spark pool, which lets you use Spark to query both structured and unstructured data.
Not only does Synapse Analytics provide data warehouses and data lakes, but it also provides a sophisticated tool for getting data from other Azure services and transforming it. Azure Synapse Pipelines is a stripped-down version of another Azure service called Data Factory that lets you create data processing pipelines. For example, a pipeline could copy data from Azure Cosmos DB to a Spark pool, run a Spark job that creates statistics about the data, and store the results in a SQL pool, all without any human intervention.
There are a couple of different approaches you could take to building a pipeline: one using code and the other without using code. I’ll give you a high-level overview of each method, starting with the “using code” method.
First, you’d create a Synapse workspace, which is a secure area where you can do your analytics work. Within that workspace, you’d create a linked service that would connect to your data in Cosmos DB. Then you’d create a Spark pool and a SQL pool. Next, you’d create a notebook containing code that copies the data from Cosmos DB to the Spark cluster. Then you’d add analytics code to the notebook to create statistics about the data. Finally, you’d add code to store the results in the SQL pool.
If you only needed to run this code once, then you’d just run it, but if you needed to run this code on a regular basis, then you could create a pipeline and schedule it to run the notebook every night at 11 PM, for example.
Okay, now here’s how it would work with the no-code method. You’d still create a linked service for Cosmos DB and a SQL pool in your workspace. Then, instead of creating a notebook with code in it, you’d create a pipeline with a data flow in it.
A data flow is a graphical representation of the activities you want to perform. This shows a simple one with activities for loading data from Cosmos DB, transforming it, and storing the results in a SQL pool. You’d need to configure each one of these activities, but you wouldn’t need to write any code. You also wouldn’t need to create a Spark pool to run these activities because the Synapse Pipelines service would take care of that for you. After getting everything ready, you’d schedule the pipeline to run the data flow at specific times.
Before we go on, I should tell you a bit more about Spark and SQL pools because if you’re not careful, they can cost you a lot of money. Both of these types of pools are clusters of virtual machines. In the example shown here, the code in the notebook would run on the Spark pool cluster and would store the results on the SQL pool cluster.
If you’ve ever accidentally left a cluster of virtual machines running, you’ll know that it can get very expensive very quickly. Microsoft provides some mechanisms to help you avoid that. First, when you create a Spark pool, if you enable Autoscale, you specify the minimum and maximum number of VMs in the cluster. Then the pool automatically scales up and down within that range based on the requirements of the workloads you run on it. You can also enable Auto-pause, which will stop the cluster if it has been inactive for a given period of time. By default, it auto-pauses after 15 minutes.
SQL pools have similar features, but the underlying architecture is much different. First, it’s important to know the difference between dedicated and serverless SQL pools. Dedicated SQL pool is the new name for Azure SQL Data Warehouse, which is a service that’s been around for quite a few years. It provides both a compute cluster and storage.
When you create a dedicated SQL pool, you specify how many DWUs (or Data Warehousing Units) to allocate. DWUs set the amount of CPU, memory, and I/O in the compute cluster. You can only increase or decrease the number of DWUs manually because there’s no autoscaling feature. Storage space is provided by Azure Storage, so it scales independently from the compute cluster. If you don’t need to run a SQL pool all the time, you can manually pause it when it’s not in use. When it’s paused, you won’t pay for the compute cluster, but you’ll still pay for the storage being used by the data warehouse.
Serverless SQL pools are actually very different from dedicated SQL pools. The only thing they have in common is that they both let you run SQL queries. Serverless SQL pools don’t have their own storage, and they don’t have access to data in dedicated SQL pools either.
So what can you use them for? Well, they can query data in other places. For example, if you have files in Azure Storage that are in CSV, Parquet, or JSON format, you can query them using a serverless SQL pool. You can also query Azure Open Datasets, which are public datasets containing information on topics like weather and genetics. Surprisingly, if you’ve used a Spark pool to create an external table in Azure Storage, then you can use a serverless SQL pool to query it, even if the Spark pool has been shut down.
Serverless SQL pools are different from the other types of pools in a couple of ways. First, you don’t have to pay for the compute resources in them. You only have to pay for the amount of data processed by your queries. Second, you don’t even have to create them because when you create an Azure Synapse Workspace, it’ll automatically create a serverless SQL pool as well.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).