Running a More Complex Job

Contents

Introduction
1
Introduction
PREVIEW1m 47s
2
Overview
PREVIEW3m 9s
Using Azure Stream Analytics
6
7
Scaling
5m 37s
Conclusion
9
Start course
Difficulty
Intermediate
Duration
50m
Students
3036
Ratings
4.8/5
starstarstarstarstar-half
Description

Azure Stream Analytics (ASA) is Microsoft’s service for real-time data analytics. Some examples include stock trading analysis, fraud detection, embedded sensor analysis, and web clickstream analytics. Although these tasks could be performed in batch jobs once a day, they are much more valuable if they run in real time. For example, if you can detect credit card fraud immediately after it happens, then you are much more likely to prevent the credit card from being misused again.

Although you could run streaming analytics using Apache Spark or Storm on an HDInsight cluster, it’s much easier to use ASA. First, Stream Analytics manages all of the underlying resources. You only have to create a job, not manage a cluster. Second, ASA uses Stream Analytics Query Language, which is a variant of T-SQL. That means anyone who knows SQL will have a fairly easy time learning how to write jobs for Stream Analytics. That’s not the case with Spark or Storm.

In this course, you will follow hands-on examples to configure inputs, outputs, and queries in ASA jobs. This includes ingesting data from Event Hubs and writing results to Data Lake Store. You will also learn how to scale, monitor, and troubleshoot analytics jobs.

Learning Objectives

  • Create and run a Stream Analytics job
  • Use time windows to process streaming data
  • Scale a Stream Analytics job
  • Monitor and troubleshoot errors in Stream Analytics jobs

Intended Audience

  • Anyone interested in Azure’s big data analytics services

Prerequisites

This Course Includes

  • 50 minutes of high-definition video
  • Many hands-on demos

Resources

The github repository for this course is at https://github.com/cloudacademy/azure-stream-analytics.



Transcript

So far, we’ve only been getting input data from a file and outputting to the screen. Now I’ll show you how to use a real processing architecture for your jobs.

 

Suppose you work for a telephone company and you’ve been asked to create an analytics job that detects a particular type of fraudulent phone call. This type of fraud occurs when two calls are made from two different countries at around the same time, but using the same phone identity.

 

Since we don’t have access to a real stream of phone call data (well, at least, I don’t), we’re going to use a program that will generate data about random calls. We’ll feed this data into Event Hubs. Then it will act as input for a Stream Analytics job. Finally, we’ll send the output to Data Lake Store.

 

First, we need to create an event hub. In the Azure Portal, click "Create a resource", Internet of Things, and Event Hubs. Before we create an event hub, it asks us to create an event hub namespace. Then we’ll add an event hub to this namespace.

 

Your name must be globally unique, so use something different from the one I use. I’ll call mine “ehns-guy”. Click on “Pricing tier”. It’s set to Standard right now, which is the more expensive option. However, for this quick demo, it won’t cost almost anything, regardless of whether we choose Basic or Standard. I’m going to choose Basic anyway, because we don’t need to use Standard for this scenario.

 

For the resource group, use the one you used last time, and also use the same location you used last time.

 

If you need to ingest data more quickly than 1 MB/s, then you can increase the number of Throughput Units. It will cost more, though, so don’t do it unless you need to. We definitely don’t need more than 1 MB/s for this example job, so leave it at 1.

 

Now check “Pin to dashboard” and click Create. It’ll take a little while to deploy, so I’ll fast forward.

 

OK, now click the “Add Event Hub” button. Call it “fraud-detection”. The partition count is used for doing parallel processing in large jobs. I’m going to talk about it more in the scaling lesson. We don’t need to increase it for this job, so leave it at the default. Message Retention is greyed out because I chose the Basic pricing tier, and it doesn’t have an option to increase message retention. That’s also the case with the Capture feature, which allows you to stream data directly to Blob Storage or Data Lake Store. Click Create.

 

Alright, now that the event hub is ready, we need to allow our phone call generator to send data to it. Click on the fraud-detection hub. Then click “Shared access policies” in the Settings menu. Click Add. Call it “policy-manage”, check the Manage box, and click Create.

 

Now click on the policy and then click the copy button next to the connection string for the primary key.

 

Now in the GitHub repository you downloaded, go to the DataGenerators folder and then the TelcoGeneratorWin folder and edit the telcodatagen.exe.config file. There are two config files, so make sure you edit this one.

 

Where it says, “ENTER YOUR KEY”, remove that and paste your connection string. Then, at the end of the connection string, delete the part that says, “EntityPath=fraud-detection”. Make sure you delete the semicolon too. Then, in the line above it, where it says, “ENTER YOUR EVENT HUB NAME”, remove that and replace it with “fraud-detection”. Then save the file.

 

Now click here and copy the path so you don’t have to type it. Then open a command window. I’m doing it by using Windows-key-R to open the Run dialog box, and then typing “cmd”. Now type “cd “ and Ctrl-V to paste the pathname. To start the call generator, type “telcodatagen 1000 .2 2”. The first argument is the number of call records to generate per hour. The second is the probability of fraud. The third is the number of hours to run the program. You can copy this command from the course.md file at the base of the GitHub repository, if you want.

 

Alright, it’s generating lots of call records, so now we can create a Stream Analytics job to process them. Back in the portal, click "Create a resource", Data + Analytics, Stream Analytics job. Call it “fraud-detection”. Set the resource group and location, if they’re not already correct, then check “Pin to dashboard” and click Create.

 

OK, now click Inputs and then “Add stream input”. You can choose from three types of input sources: Event Hub, IoT Hub, and Blob storage. We’re using an Event Hub, so choose that.

 

For the input alias, type “CallStream”. We’re going to refer to this name in our query, so make sure you typed it correctly. With the “Select Event Hub from your subscriptions” option, it automatically fills in the other fields, which is great. If you haven’t created any other Event Hubs, then the Event Hub namespace and name field should be correct, but you’ll need to change “Event Hub policy name” to the policy we created earlier, which was “policy-manage”. Click Save. It automatically ran a connection test, which succeeded. Close this blade.

 

Now that we’ve defined a data source, it’s time to write the query to process it. Click “Edit query”. First, we need to tell it to get some sample data from the input stream. Click on the 3 dots next to CallStream and select “Sample data from input”. Set the duration to 3 minutes so it will gather 3 minutes worth of data. Click OK. You can see that it’s sampling data. It stores this data while you’re in the query window, but as soon as you close it, the sample data will be discarded.

 

OK, now to make sure that it read the sample data properly and also to see what fields are available for your query, change YourInputAlias to CallStream. Then click Test.

 

It should come back with plenty of rows. Now we’re ready to put in a query. This one will take a lot of typing, so you can copy and paste it from the course.md file in the GitHub repository.

 

First, we select these six columns and discard the rest. The first one is System.Timestamp. Recall that we either need to tell it which column to use as the timestamp or let it default to the time when the data record was received by the event hub. Here, we’ve told it to use the CallRecTime column. We’ve actually specified that twice, but that’s because we’ve specified the input stream twice, so we need to tell it which timestamp to use in both cases.

 

This is a self-join because we’re joining an input stream with itself. We create a list of call records that have the same CallingIMSI (that is, the same phone identifier) and timestamps within 5 seconds of each other. Then in that list, we find ones where the two calls were from different switches. Since these switches are in different countries, it wouldn’t be possible in most cases for that to happen, which suggests fraud.

 

By the way, this DATEDIFF function is different from the standard SQL version. This DATEDIFF is specific to the Stream Analytics Query Language. It must be in the ON...BETWEEN clause. The first parameter is the time unit, which is seconds, in this case. The next two arguments are the sources for the join.

 

OK, now click Test. Since the call generator we’re using creates lots of call records that meet the fraud detection criteria, it comes back with lots of rows. For example, this row has a call from a switch in Germany and a call from a switch in Australia.

 

Click Save, which will save the query, but not the sample data. Then close the query window. It’s still showing the original query, but if you do a browser refresh, then it will show the new one.

 

Now that we’ve configured the input and the query, it’s time to configure the output. We’re going to send the results to Data Lake Store. Open the portal in another browser tab and "Create a resource", Data + Analytics, Data Lake Store. Give it a globally unique name. I’ll call mine “dlsguy”. Choose the same resource group and location as before, if possible. There are only a few locations where Data Lake Store is available, so if you have to choose a different location from what you chose before, then that’s fine. Check “Pin to dashboard” and click Create.

 

Go back to the other browser tab and click on Outputs. Then click Add and select Data Lake Store. Call it “FraudResults”. Check that the account name is the name of the Data Lake Store you just created. The “Path prefix pattern” is where you want it to put your results. The example they show is “cluster1/logs/{date}/{time}”. For this example, we’ll use something a lot simpler, so just put in “fraud/”.

 

Now you have to authorize Stream Analytics to access your Data Lake Store, so click Authorize. Then click Save. It tested the connection and it worked. Close the Outputs blade.

 

Alright, we’re finally ready to run this job. First, make sure the call generator is still running OK, now go back to the portal and Click Start. It asks if we want to run it now or in the future. Leave it on Now and click Start.

 

This time, instead of reading sample data, it’s reading the input stream from the event hub directly. To see the output, go back to the other browser tab and click “Data explorer”. If it says there aren’t any items yet, then click the Refresh button until the fraud folder shows up. Now click on the folder. It contains a JSON file. Click on it. This is the same kind of data as we saw when we ran the test, except it’s in JSON format.

 

When you’re done looking at the data, go back to the call generator and hit Ctrl-C until it stops. Then go to the browser tab with your Stream Analytics job and click Stop.

 

And that’s it for this lesson.

About the Author
Students
191580
Courses
98
Learning Paths
169

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).