Using Azure Stream Analytics
The course is part of this learning path
Azure Stream Analytics (ASA) is Microsoft’s service for real-time data analytics. Some examples include stock trading analysis, fraud detection, embedded sensor analysis, and web clickstream analytics. Although these tasks could be performed in batch jobs once a day, they are much more valuable if they run in real time. For example, if you can detect credit card fraud immediately after it happens, then you are much more likely to prevent the credit card from being misused again.
Although you could run streaming analytics using Apache Spark or Storm on an HDInsight cluster, it’s much easier to use ASA. First, Stream Analytics manages all of the underlying resources. You only have to create a job, not manage a cluster. Second, ASA uses Stream Analytics Query Language, which is a variant of T-SQL. That means anyone who knows SQL will have a fairly easy time learning how to write jobs for Stream Analytics. That’s not the case with Spark or Storm.
In this course, you will follow hands-on examples to configure inputs, outputs, and queries in ASA jobs. This includes ingesting data from Event Hubs and writing results to Data Lake Store. You will also learn how to scale, monitor, and troubleshoot analytics jobs.
- Create and run a Stream Analytics job
- Use time windows to process streaming data
- Scale a Stream Analytics job
- Monitor and troubleshoot errors in Stream Analytics jobs
- Anyone interested in Azure’s big data analytics services
- SQL experience (recommended)
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 50 minutes of high-definition video
- Many hands-on demos
The github repository for this course is at https://github.com/cloudacademy/azure-stream-analytics.
When something goes wrong with a job, it can often be difficult to know where to start looking. Fortunately, Stream Analytics provides many tools to help with troubleshooting.
The most common problems are: connectivity issues with inputs or outputs, issues with input data, and issues with your query.
The easiest way to diagnose connectivity issues is to use the Test feature. You can get to it in a variety of ways, but if you click on the arrow in the Inputs box, it will show you a list of your inputs. In the menu on the right, you can select Test. I’m running the call generator right now, but I’m going to stop it and then test the connectivity. OK, let’s see what happens when I run the test. It says it was successful. That’s weird, isn’t it? Well, not really, because it tested the connection to the Event Hub, not to the data source. In a situation like this, you could see if data is flowing into the Event Hub by using the Service Bus Explorer.
If you think you might be having issues with the input data and you’re able to stop the job, then do that, change the query to “SELECT * FROM InputStream”, and run a test by sampling the data. That way you can see what the input data looks like.
Debugging your queries usually takes a bit more work. One technique is to reduce it to a simpler query, test it, and then build it back up, testing at every step. One common problem is having a WHERE clause that filters out every input record, so there are no outputs. Another problem unique to streaming jobs is when timestamps are earlier than the job start time. In that case, all of the input records will be dropped.
Microsoft also provides a very handy tool for examining all of the stages in your job. It’s the “Job diagram” in the “Support + Troubleshooting” menu. The fraud detection job is quite simple, so the diagram just shows the input, the query step, and the output. If you hover over each of them, it will give you more details on what’s happening with them. It shows lots of metrics. On the query step, it shows the query itself as well as a few metrics. You can even see what the partitioning looks like by clicking “Expand all”.
In the graph below, you can also select particular metrics and see how they change over time. If you choose Replay instead of Live, you can see what happened with the job during an earlier time period.
In the last lesson, we saw how to create alerts to notify us when something unexpected happens with one or more of the metrics. Stream Analytics also keeps logs. Activity logs are always available, but they only include entries for high-level operations. To get more details, you can enable diagnostic logs, which are in the Monitoring menu.
You have to click “Turn on diagnostics”. You can give it any name. Then you have to decide where to send the logs: to storage, to an event hub, or to Log Analytics.
There are two types of logs. Execution logs contain entries about events that happen during job execution, especially errors, such as connectivity errors and data processing errors. This is a good one to enable. Authoring logs contain entries related to job authoring operations, such as creating a job, adding inputs and outputs, adding the query, and starting and stopping the job. In addition, you can enable logging of all metrics, which could be helpful.
You should also set the retention period for how long to keep the logs. If you leave it at 0, then it will actually keep the logs forever, which could potentially take up a lot of space and increase your costs.
To view the logs, you need to go into the service where you’re sending them. I set the destination to a storage account, so that’s where I’d need to look.
If you have a problem that’s difficult to troubleshoot, you should go to “Diagnose and solve problems”. First, it tells you if there are any problems with the Azure service itself. Then, it has suggestions for how to solve common problems. Finally, if you still can’t resolve your issue, then you can submit a support ticket with Microsoft.
And that’s it for troubleshooting.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).