Azure Stream Analytics (ASA) is Microsoft’s service for real-time data analytics. Some examples include stock trading analysis, fraud detection, embedded sensor analysis, and web clickstream analytics. Although these tasks could be performed in batch jobs once a day, they are much more valuable if they run in real time. For example, if you can detect credit card fraud immediately after it happens, then you are much more likely to prevent the credit card from being misused again.
Although you could run streaming analytics using Apache Spark or Storm on an HDInsight cluster, it’s much easier to use ASA. First, Stream Analytics manages all of the underlying resources. You only have to create a job, not manage a cluster. Second, ASA uses Stream Analytics Query Language, which is a variant of T-SQL. That means anyone who knows SQL will have a fairly easy time learning how to write jobs for Stream Analytics. That’s not the case with Spark or Storm.
In this course, you will follow hands-on examples to configure inputs, outputs, and queries in ASA jobs. This includes ingesting data from Event Hubs and writing results to Data Lake Store. You will also learn how to scale, monitor, and troubleshoot analytics jobs.
Learning Objectives
- Create and run a Stream Analytics job
- Use time windows to process streaming data
- Scale a Stream Analytics job
- Monitor and troubleshoot errors in Stream Analytics jobs
Intended Audience
- Anyone interested in Azure’s big data analytics services
Prerequisites
- SQL experience (recommended)
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 50 minutes of high-definition video
- Many hands-on demos
Resources
The github repository for this course is at https://github.com/cloudacademy/azure-stream-analytics.
The amount of data produced every day has increased exponentially. At first, companies tried to deal with this massive amount of data by running batch jobs, but that approach stopped working for a couple of reasons. First, the batch jobs kept getting longer and longer, and the amount of processing power needed to run the jobs kept increasing. Second, and more importantly, companies had applications that needed data to be processed in real time.
Some examples include stock trading analysis, embedded sensor analysis, fraud detection, and web clickstream analytics. Although these tasks could be performed in batch jobs once a day, they were much more valuable if they ran in real time. For example, if you can detect credit card fraud immediately after it happens, you are much more likely to prevent the credit card from being misused again.
Streaming analytics takes a bit more work to set up, though. Here’s how it’s typically architected on Azure. First, you have applications, devices, and gateways that generate events. Some examples are power meters, retail websites, and gaming apps on smartphones.
To aggregate all of the data streams for a particular analytics job, you would funnel them into either Events Hubs or IoT Hubs. These services can ingest millions of events per second. IoT Hubs have more features for Internet of Things devices than Event Hubs do, but they’re also far more expensive. Another option is to use Azure Blob Storage.
In many cases, you’ll need to combine the streaming data with static data, which Stream Analytics can also get from Blob Storage. For example, you may need to combine a user’s activities on a website with their account information. This is also known as reference data. Stream Analytics then performs a series of transformations on all of this data, typically summarizing it and looking for particular patterns or relationships. If you need to run a machine learning algorithm on the data, then your analytics program can call an Azure Machine Learning web service.
After all of the data processing is done, it needs to do something with the results. They could be archived to a storage service, such as Data Lake Store or Cosmos DB; they could be sent to a real-time dashboard in Power BI; or they could be sent to an automation service that acts on the information that was discovered.
Microsoft also offers alternatives to Stream Analytics. Instead, you could spin up an HDInsight cluster and run either Spark or Storm jobs on your data. This is a perfectly viable alternative, but there are some pretty big differences.
First, with HDInsight, you have to create and manage the cluster yourself. With Stream Analytics, you only have to create a job and don’t have to worry about managing the underlying resources. Spark and Storm are open source software and are part of the Hadoop ecosystem, so any code you write can easily be run on-premises or on other cloud platforms. In contrast, Stream Analytics is a Microsoft proprietary technology, so the code is not portable.
On the other hand, there is an advantage to its coding approach. It uses Stream Analytics Query Language, which is a variant of T-SQL. That means anyone who knows SQL will have a fairly easy time learning how to write jobs for Stream Analytics. Spark and Storm don’t have a SQL option. Instead, you need to write your jobs in Scala, Java, or Python, which are more difficult to learn for many data analysts.
And that’s it for the overview.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).