1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Introduction to Azure Synapse Analytics

Spark Pool Demo

Start course
Overview
Difficulty
Intermediate
Duration
28m
Students
936
Ratings
4.8/5
starstarstarstarstar-half
Description

This course is a quick introduction to Microsoft’s Azure Synapse Analytics. It covers serverless SQL pools, dedicated SQL pools, Spark pools, and Synapse Pipelines. It does not cover more advanced topics, such as optimization and security, because these topics will be covered in another course.

Learning Objectives

  • Create and use serverless SQL pools, dedicated SQL pools, Spark pools, and Synapse Pipelines

Intended Audience

  • Anyone who would like to learn the basics of using Azure Synapse Analytics

Prerequisites

  • Experience with databases
  • Experience with SQL (not mandatory)
  • A Microsoft Azure account is recommended if you want to do the demos yourself (sign up for a free trial at https://azure.microsoft.com/free if you don’t have an account)
Transcript

Okay, in this demo, I'm going to show you how to create and use a spark pool. First, we'll go to manage and Apache Spark pools. New. Doesn't really matter what you call it here.

Now for the node size, it defaults to medium which is eight virtual cores and 64 gigs of memory. You can switch it to a small if you wanna save money if you don't need one that's that big. And for auto scale, it's enabled by default. And what this means is you'll have a minimum of three nodes and a maximum of 10, and it will automatically scale based on how busy your cluster is.

Now, even if you switch it to disabled, you can't go below three. So you're going to have three regardless of which way you do this. I'll put it back on auto scale. And there seems to be something wrong with the price estimator here. It always comes out to zero, but anyway.

On additional settings, there's something I wanted to show you which is that there's an auto pause feature. And I mentioned this before, if this is enabled, then after a given number of minutes of inactivity, it will automatically pause the cluster. And so it's set to 15 minutes right now.

Rather than actually creating this pool, we're going to do it a different way. So I'll cancel this. I just wanted to show you how you would actually create one. But if we go back to home and go to learn, then this first one here automatically creates a pool for you called Sample Spark. So we'll do that. It takes a minute or two, so I'll fast forward.

Okay, it's done. So it brought us to a notebook and this is similar to the SQL script that I showed you before. If you're familiar with Jupyter notebooks, this will seem very familiar. So these notebooks let you enter some code, but they also show you the results of the code and the results get stored in the notebook.

So this notebook was created by somebody else, someone at Microsoft I'm assuming, and they ran all the code in this notebook. And so the results are actually in here. This example uses the taxi trip data again.

With this first cell, it's just loading the data into the Spark pool so there's no output, but it shows that the command was executed earlier and it took about 28 seconds to run. If you want to rerun the code, you can click the run button in this cell. I'll show you that in a minute.

The next cell does have some output. It prints the schema of the dataset. The third cell calculates the average trip distance and the total trip distance for one passenger trips, two passenger trips, etc.

If we go to chart, it's actually pretty messed up. To fix it, click on view options, and then change the key to passenger account and change values to average trip distance and click apply and click view options again. That looks better.

If you're wondering why we don't see a big spike for three passenger trips like we did in the dedicated SQL pool demo, it's because this script loaded data from different dates than the other script did.

The last cell displays the same information but is using different libraries to create the graph. It uses the Matplotlib and Seaborn libraries. The orange line at the bottom is the average trip distance. Since the numbers for the total trip distance are way bigger than the average distances, it's just a straight line at the bottom. 

You might be saying, "Wait, those numbers on the side aren't very big. So what are you talking about?" Well, notice this one e6 at the top. That means 10 to the sixth power, which is one million. So these numbers are actually one million, two million, etc.

Okay, so let's remove the total trip distance so we can see the details for the average trip distance. We can do that by removing this line here. Now, if we were to rerun this cell, we'd get an error because it depends on the results of the previous cells, but we didn't run those cells in this session. So to make this work, we need to click run all at the top. This is going to take a while so I'll fast forward.

Okay, it's done. Now we can see the details of the average trip distance. It's the same as we saw before with the bar chart above.

Now, suppose we wanted to make another change to the code in this cell. For example, let's say we only wanted to show the total trip distance. Since we've already run all the previous cells in this notebook, we can just rerun this cell now. There, that worked and it was really quick too. And that's it for this demo.

About the Author
Avatar
Guy Hummel
Azure and Google Cloud Content Lead
Students
111712
Courses
67
Learning Paths
87

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).