Using Azure Stream Analytics
The course is part of these learning paths
Azure Stream Analytics (ASA) is Microsoft’s service for real-time data analytics. Some examples include stock trading analysis, fraud detection, embedded sensor analysis, and web clickstream analytics. Although these tasks could be performed in batch jobs once a day, they are much more valuable if they run in real time. For example, if you can detect credit card fraud immediately after it happens, then you are much more likely to prevent the credit card from being misused again.
Although you could run streaming analytics using Apache Spark or Storm on an HDInsight cluster, it’s much easier to use ASA. First, Stream Analytics manages all of the underlying resources. You only have to create a job, not manage a cluster. Second, ASA uses Stream Analytics Query Language, which is a variant of T-SQL. That means anyone who knows SQL will have a fairly easy time learning how to write jobs for Stream Analytics. That’s not the case with Spark or Storm.
In this course, you will follow hands-on examples to configure inputs, outputs, and queries in ASA jobs. This includes ingesting data from Event Hubs and writing results to Data Lake Store. You will also learn how to scale, monitor, and troubleshoot analytics jobs.
- Create and run a Stream Analytics job
- Use time windows to process streaming data
- Scale a Stream Analytics job
- Monitor and troubleshoot errors in Stream Analytics jobs
- Anyone interested in Azure’s big data analytics services
- SQL experience (recommended)
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 50 minutes of high-definition video
- Many hands-on demos
The github repository for this course is at https://github.com/cloudacademy/azure-stream-analytics.
OK, now let’s run a query to do what we originally wanted, which is to get a list of sensors with a temperature of more than 100 degrees over a 30-second period. It might seem like we could just change the WHERE clause to “WHERE temp > 100” and we’d be done, except that we need to handle the “over a 30-second period” requirement. How would we do that? Well, since the sensors typically send readings about every 5 seconds, we need to average them. That’s easy enough. We can just change “temp” to “Avg(temp)”.
That’s a good start, but we also need to tell it what temperature readings to average. To do that, we need to use a GROUP BY clause. First, we need to group by sensor, which is called “dspl” in this data, so put in “GROUP BY dspl” above WHERE. It really doesn’t like that. There are red lines all over the place, indicating that there are syntax problems.
In this case, it’s mainly because you can’t use WHERE with GROUP BY. You have to use HAVING instead. They mean the same thing, but when you use GROUP BY, you have to use HAVING. So change WHERE to HAVING.
That got rid of most of the red, but it’s still complaining about something. If you put your mouse pointer over the red line, it’ll tell you what the problem is. It says “Exactly one temporal window is expected”. By “temporal”, it means “time-based”. This is where the 30-second period comes in.
Fixing this is actually a bit more complicated than it seems because Stream Analytics supports three different types of temporal windows. Tumbling windows are the most straightforward and they’re what you’ll probably use most often. In this example, the input stream is broken into 5-second windows. So the first window contains three records, the second one contains two, and so on.
Hopping windows are similar, but they overlap with each other. In this example, each window is 10 seconds long, but a new window is created every 5 seconds. This is called the hop size. In this case, there is always a 5-second overlap between two adjacent windows. This means that some data records can be in more than one window, such as these ones.
Sliding windows are kind of like hopping windows except that the hop size is not fixed. Stream Analytics finds every single unique window of the length you specify. By unique, I mean that each window has a different set of data records in it. For example, these two windows are the same except that the second one doesn’t have this record in it. Every window is still the same size, though. In this case, they’re all 10 seconds long.
Why would you want to use sliding windows? Well, for example, suppose you wanted to find out if at least 5 sensors recorded a temperature of over 100 degrees during a 10-second period. If you used a tumbling window, then it would look like this. The highest number of hot sensors in a 10-second period would be 4. If you used a sliding window, though, then it would find a 10-second period with 5 hot sensors.
OK, let’s get back to writing the query. Which type of window should we use? Let’s try a tumbling window first. At the end of the GROUP BY line, before dspl, add TumblingWindow, bracket, then the unit of time we want, which is seconds, so type “second”, then comma and the number of seconds, which is 30 in this case, a closing bracket, and a comma. There, now the red line is gone.
But wait, now there’s a red line under time. What’s the problem there? Well, when you use GROUP BY, you can only select columns and aggregate functions that are in the GROUP BY clause. That would be these two items. The TumblingWindow function doesn’t count because you can’t specify that in a SELECT statement. So the only column in our SELECT statement that it will accept is dspl. It doesn’t have red lines under temp or humidity because it only shows one error at a time. So if we fix the current error by removing “time”, it’ll move on to the next error, which is “temp”. We can fix that one by changing it to the average temperature.
OK, now we need to get rid of humidity. Remember to remove the comma as well. Alright, now the syntax checker is finally happy again, but wait, our results will no longer have a timestamp, but they won’t be very useful without that. So if we can’t select the time column, then how do we do it? We can use a system property called System.Timestamp. Add that to the beginning of the SELECT statement and remember to add a comma.
It still doesn’t like that. Why not? Because you have to give it a name. Let’s call it “EndTime” because the timestamp assigned to the tumbling window will be for the end of the window. That is, each timestamp shown will be for the end of a 30-second period.
Alright, now click Test. It only came back with five results, one for each sensor. That doesn’t seem right. Take a close look at the timestamps. Not only are they all the same, but they’re from 1970! The data is from 2016, so what’s going on?
Well, we didn’t tell it where to get the timestamp. You see, Stream Analytics doesn’t know that the time column contains the timestamp it should use for each record. If you don’t tell it where to find the timestamp, then System.Timestamp will refer to the time when the data record arrived. For example, if we were using an Event Hub, then the record would be timestamped when it arrived at the Event Hub. In our case, we uploaded sample data from a file, so the system didn’t know what timestamp it should use. So it used the default, which is 12:01 on January 1st, 1970.
OK, then how do we tell it where to find the timestamp? Simple. We use a TIMESTAMP BY clause. Put it after FROM InputStream. Since the timestamp we want to use is in the time column, put “time” after “TIMESTAMP BY”. Alright, let’s give it another try.
That’s better. Now there are 245 rows and the timestamps are correct. You can tell that the numbers in the temperature column are averages because some of them have decimals. To make the output very clear, change the name of the Temperature column to Average Temperature.
How much of a difference do you think it would make it we used a sliding window instead of a tumbling window? Let’s find out.
Wow, it came back with more than 10 times as many rows. That’s pretty incredible, but it makes sense if you think about it. If you look through the output, you’ll see that there are lots of rows that are almost the same. For example, look at these two rows for sensorA. The first one’s window ends at 20:47:55 and the second one’s window ends at 20:47:57, just two seconds later. That didn’t happen with the tumbling windows because those windows have to be at least 30 seconds apart.
Save this, just in case you want to come back to it later.
And that’s it for time windows.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).