Azure Stream Analytics (ASA) is Microsoft’s service for real-time data analytics. Some examples include stock trading analysis, fraud detection, embedded sensor analysis, and web clickstream analytics. Although these tasks could be performed in batch jobs once a day, they are much more valuable if they run in real time. For example, if you can detect credit card fraud immediately after it happens, then you are much more likely to prevent the credit card from being misused again.
Although you could run streaming analytics using Apache Spark or Storm on an HDInsight cluster, it’s much easier to use ASA. First, Stream Analytics manages all of the underlying resources. You only have to create a job, not manage a cluster. Second, ASA uses Stream Analytics Query Language, which is a variant of T-SQL. That means anyone who knows SQL will have a fairly easy time learning how to write jobs for Stream Analytics. That’s not the case with Spark or Storm.
In this course, you will follow hands-on examples to configure inputs, outputs, and queries in ASA jobs. This includes ingesting data from Event Hubs and writing results to Data Lake Store. You will also learn how to scale, monitor, and troubleshoot analytics jobs.
- Create and run a Stream Analytics job
- Use time windows to process streaming data
- Scale a Stream Analytics job
- Monitor and troubleshoot errors in Stream Analytics jobs
- Anyone interested in Azure’s big data analytics services
- SQL experience (recommended)
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 50 minutes of high-definition video
- Many hands-on demos
The github repository for this course is at https://github.com/cloudacademy/azure-stream-analytics.
The jobs we’ve run in this course have been fairly small, so we haven’t had to worry about the cloud resources that were deployed to run them. If you need to run a larger job, then you may need to make some changes so that it’ll run properly.
The first adjustment to consider is the number of Streaming Units to allocate to the job. An SU represents a certain capacity of CPU, memory, and I/O. The most important of these is memory because Stream Analytics does all of its processing in memory rather than on disk. If your job runs out of memory, then it will fail.
To prevent this from happening, you should create an alert that tells you if the SU utilization goes above 80%. Unfortunately, you can’t increase the number of SUs while a job is running, so ideally you should allocate enough SUs before you put the job into production. Otherwise when it starts to run out of memory, you’ll have to stop the job, increase the number of SUs, and restart it.
OK, so how do we do that? It’s very simple. Go into the fraud detection job and then select Scale from the Configure menu. You can either move the slider or enter a number in the box at the right. However, it will only let you set it to certain numbers. For example, if you try to change it to 8, it won’t let you but you can change it to 6. Then click Save. That’s it. That’s all you have to do.
The big question is how can you figure out how many SUs you should allocate to a job? Well, it all comes down to how well you can parallelize your job. If you’re going to run a large job, then ideally, you’ll want to make it embarrassingly parallel. This is when you can split all aspects of the job into multiple, parallel tasks, so you can give these tasks to multiple workers that all run at the same time.
For example, if you get 10 workers to process a job, then it will be roughly 10 times faster than using only 1 worker. That sounds simple enough, but the tricky part is that you need to configure your job in such a way that you can split it into multiple pieces.
To achieve that, you need to do three things. First, the input must be split into n partitions. Second, the query must be written to act on those n partitions. Third, the output must written to n partitions. Note that the number of partitions has to be the same all the way across. If all three of these conditions are true, then the job is embarrassingly parallel and you can easily scale it up using streaming units.
The way you create partitions on the input side is to set the partition count when you create the Event Hub or IoT Hub.
If your query will group the data using a GROUP BY clause, then your input code will also need to set a partition key for each piece of data. For example, if we were to set a partition key on the input data for our earlier sensor example, then we would modify our query to look like this.
PartitionId is a special column that Stream Analytics adds to match the partition ID of the data that comes from the Event Hub. So first, we tell it to partition the input stream by the PartitionId column. Then we add the PartitionId column to the GROUP BY clause. To make this work, our code that handles sending data to the Event Hub has to tell it to use the dspl column as the partition key.
Recall that in this scenario, dspl represents a sensor name. Suppose our incoming data includes records from 40 sensors. And suppose that we create 4 partitions on the Event Hub. If we set the partition key to dspl, then each partition will contain records from 10 sensors, so each of the 4 workers can process this query on one quarter of the sensors.
Finally, we need to output the data to the same number of partitions. However, it depends on which Azure service you’re outputting to. If it’s Power BI, SQL Database, or SQL Data Warehouse, then you’re out of luck because they don’t support partitioning. If it’s Event Hub, IoT Hub, or Cosmos DB, then you need to set the partition key. You can do that when you add an output stream to a job. Here’s where you set it.
Outputting to one of these services is the easiest because they support partitioning, but they don’t require you to set the partition key.
OK, now back to our question of deciding how many SUs to allocate. There’s one more thing you need to know first. 6 SU represents the full capacity of a single computing node. So, if your job isn’t parallelizable, then 6 SU will give you your maximum performance. However, if you can break your query up into multiple steps, then Stream Analytics will put each step on its own 6 SU node. So for a job that isn’t parallelizable, try running it with 6 SU times the number of steps in the query. If the SU utilization is low, then try running it with fewer SUs to save money.
If the job is parallelizable, and it’s too big to run on 6 SUs, then partition the input and output, and use PARTITION BY in your query. Then multiply the number of partitions by 6 SU and try running the job with that many SUs. If the SU utilization is now too low, then reduce the number of SUs. To optimize performance, make sure that the number of partitions is evenly divisible by the number of nodes. For example, if you have 4 partitions, then choose either 2 or 4 nodes, but not 3, because that would result in the 4 partitions being unevenly split across 3 nodes.
And that’s it for scaling.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).