image
Big Data 2.0 - IoT as Your New Operational Data Source

Contents

Big Data 2.0 - IoT as your New Operational Data Source
1
Main Presentation
PREVIEW43m 26s

The course is part of this learning path

Main Presentation
Difficulty
Intermediate
Duration
43m
Students
119
Ratings
3/5
starstarstarstar-borderstar-border
Description

A large part of the value provided by IoT deployments comes from data. However, getting this data into the existing data landscape is often overlooked. In this session, we will start by introducing what existing Big Data Solutions can be part of your data landscape. We will then look at how you can easily ingest IoT Data within traditional BI systems like Data warehouses or in Big Data stores like data lakes.

When our data is ingested, we see how your data analysts can gain new insights about your existing data by augmenting your Power BI reports with IoT Data. Looking back at historical data with a new angle is a common scenario.

Finally, we'll see how to run real-time analytics on IoT Data to power real-time dashboards or take actions with Azure Stream Analytics and Logic Apps. By the end of this course, you'll have an understanding of all the related data components of the IoT reference architecture.

Learning Objectives

  • Understand the challenges of big data and the kinds of data IoT systems produce
  • Understand the challenges of ingesting and storing IoT data and how it can fit into your existing data landscape
  • Learn how using IoT as an operational data source for your analytical system can unlock insights and actions across the organization

Intended Audience

This course is intended for anyone looking to improve their understanding of Azure IoT and its benefits for organizations.

Prerequisites

To get the most out of this course, you should already have a working knowledge of Microsoft Azure.

Transcript

Hello, and welcome to Big Data 2.0: IoT as your New Operational Data Source. The Internet of Things is more than just smart and connected devices. The amount and type of data they collect can truly transform your business and operations, but only if you make that data accessible. In this presentation, we will see how you can integrate IoT in your analytics trend scape. My name is Christopher Maneu. I'm a senior code advocate at Microsoft, focusing on IoT solutions. If you're in a hurry or want to share a quick version of this presentation, do not hesitate to visit this link to access the short version of this talk. However, if you want to see how to operationalize your IoT data through demos, continue watching.

Deploying and connecting thousand of IoT devices comes with a price tag. Organizations doing it expect to get some value out of it. In IoT solutions, most of the value is brought from two components. First, the digital twin. It's the digital representation of the actual product. Through the life of the IT system, this representation can be rearranged to solve various problems. If you want to learn more about this, I encourage you to check out all those sessions within this learning path.

Value of IoT deployments is largely provided by the analytics we can pull over from IoT data. Analyzing about data and making sense of it by converting it into useful information is critical to the success of any IoT project. If you don't have any previous analytics experience, we got you covered. In this presentation, we will start by quickly introducing the challenges of big data and the kind of data IoT system produce. Then, we will talk about the challenges of ingesting and storing IoT data, and why it can fit your existing data estate. Then, we'll see why are using IoT as an operational data source for your analytical system can unlock insights and actions across the organization.

Let's start with a primer on big data analytics and data management. Let's have a look the variety of data IoT systems can generate. First, IoT generates telemetry data. We mainly have JSON document. Even if from time to time, we encounter a lot of other data formats, including column-encoded ones. Then we will get all the metadata associated with the data itself, the IoT device or the system. These values may change slowly over time. These metadata can also be precious sources of information for your analytics. They can also be enrich in the process, be it for reference data or other sources.

The diversity of IoT generated data can be immense, image and video streams, but also many other formats. IoT gateways like Azure IoT Hub facilitates file upload from IoT devices, enabling the collection of all kinds of unstructured data. When working with data, one of the first question we have to solve is where to store it. I presume you're all already familiar with operational databases. These relational databases for most of the business applications out there. The way they are designed allows software developers to efficiently do a lot of writes and micro reads, while the user is interacting with is the business application. While these databases work well for application, they are not suited for analytical workloads. That's why we encounter other kind of databases, like for example, a SQL data warehouse, where we modeled the data for analysis and reporting using several techniques like denormalization.

Let's see an example of what's different between an operational database and an analytical database. Here, you have an order table. Let's see an example of what's different between an operational database and an analytical database. Here, you have an order table from a commerce application. Each order has only one line, and that line is updated when the order status changes. Now, let's have a look at the analytical counterpart. Even if we have an order table, things are quite different.

First, we have new tables, DimCustomer and DimDate. These table with allow data analysts to explore data, not only all at once across different axes or dimensions. Then, if you have a closer look at our analytical Order table, we can spot that one of the orders has multiple rows. The way these database is structured will allow to capture historical data. Doing so, data analysts can easily answer question like, who are the customers mostly affected by shipping delays? Analytics of the past is a common scenario in the IoT space. If you want to predict future events, this historical data is a foundation to most of your machine learning work you may want to start.

In order to move data from operational database to analytical database, we tend to use a process called ETL, extract, transform and load. There is a lot of technical solution to this from SQL server integration services to Azure Data Factory, or even using custom scripts. This process is mainly done in batch mode during the night. This process can be quite simple, just executing a bunch of select queries on your store and inserting the results in the analytical database, or can be quite complex with a lot of transformations and lookups. Usually speaking, big data refers to a system that produce so much diverse quantity of data that traditional BI systems cannot analyze it. We define big data systems with three adjectives called the three Vs.

First, volume. It's simply the amount of data stored to be analyzed. IoT solutions produce a high volume of data. Then, variety, is this changing and unstructured nature of the data to be analyzed. As opposed to traditional data where all the data is well-organized in table format, big data systems can analyze semi-structured data like JSON documents or unstructured data like image and video files. The wide range of IoT devices existing can produce an equally wide range of data. And at last, velocity, which is a frequency to which your system would generate data. In the IOT space, it can be as low as a metric per hour, up to several data points per second.

As we saw earlier, most of the analytical databases are data warehouse databases. These databases can be deployed on one and only server, either physical or virtual. That means that your analytical power is bounded by the limit of this server resources, mainly computes, memory and storage. Over time, you will have more historical data to dig in. You will ingest data at a higher pace, and maybe you will also have more data analysts that want to dig into this data. When you reach that certain point, the only option you have is to grow that server. Even if it should be easier if the server is a virtual one, we all know that you cannot grow that kind of server indefinitely. With the increasing data velocity, we had to come up with another architecture, the big data architecture.

In big data systems, we were using a specific architecture. The first component is a shared storage. Doubling the spec of server can cost more than double of the cost. That's why instead of using one very high-end server, we use several commodity servers group as cluster, and we will distribute the workload across them. For ensuring coordination between these nodes, we need a control node. When you connect your analytical tool to the big data cluster, you're usually opening a connection to this control node. This node will parse a query, distribute the work amongst the computer nodes, and aggregates the results before sending them back to you.

In the future, if you need more power, we can simply add more compute nodes to your clusters. On the opposite side of things, some big data systems allow you to reduce the amount of nodes if you have less analytical work to do. That's a case of solution like Azure-hosted Hadoop, HDInsight or Azure Synapse Analytics, or newest analytics solution. There is multiple ways to store and process data for your analytical needs. Whenever you're using a data warehouse or big data systems, we can land your IoT data in your data landscape. Through the rest of this presentation, we will see different ways to achieve this.

When you start thinking about how to do it, two questions emerges. In most IoT architectures, IoT data goes through central point, an IoT gateway like Azure IoT Hub that can dispatch incoming states and telemetry to downstream storages. The first question you may ask yourself is, where we can store our IoT data? Then your IoT data will probably need to be refined and structured in a specific way, so you can run your analytics on it. Therefore, you will need a place to process this IoT data. We've seen why IoT and Big Data share the same challenges about the type of data they handle.

We will now see why integrating IoT data as your new operational database is facilitated by Azure Solutions. There is three main challenges regarding data ingestion. First, we need to prepare the data, so the analytics can find and use it. Then, we will need to aggregate data to the right level. Having a millisecond resolution on temperature sensor can be useful. But if you're just monitoring your host operator, it may not be the right fit. And then, we may have to duplicate data to different and specialized storage analysis systems to unlock specific analytics scenarios.

Now, let's have a look at this simple IoT architecture. On the left, we find our IoT gateway. On the right, our analytics and reporting tool, Power BI. Along this presentation, we will build upon this simple architecture by adding services and features. Let's start with the raw storage. We will store all incoming telemetry data in these raw formats directly into data lake. This data will unlock all analytics and machine learning scenario we may arise in the future.

On the other side, data analytics wants harmonized data. A cooked meal, if you want. That's why this data will also be processed and ingested into an analytical data store. Here, a data warehouse. For doing this, we will use Azure Stream Analytics service. As you will see in the next demo, with trim analytics, you can easily process and store data into several data source including SQL data warehouse, without having to deploy any complex service.

In very specific cases, some IoT data can also be stored in a classical SQL database. This is also something you can achieve with Stream Analytics. Because cooked data often loses big golden nuggets that our scientists are looking for, our current architecture, storing both raw and prepared data, enables a wide spectrum of analytical needs. In this demo, we'll see how you can easily ingest IoT data into data lake, and process it so it can be stored in the data warehouse.

Now, we will see how you can output all your incoming IoT messages to the data lake. Actually, this is another feature directly embedded within IoT Hub. For doing this, you go to the portal and then you will scroll down to message routing. And actually what you will see here is you can add custom endpoints. So you can directly from IoT Hub, send all the incoming messages to several destination, including Event Hubs, Service Bus Queues and Topics, and storage. So we'll go there, click Add and select Storage. Then we will need to add an end point. Let's say a data lake and point. And then we need to pick a container into an existing Azure storage account. So you can actually add your messages to classical storage account or to data lake.

For this demo, we will choose data lake here, and I will create a new container IOT rural data. Go ahead and click on create, and then select my newly created Azure Container. Now, you have several settings. The first one is around the batch frequency and the shank size window. So actually IOT hub, we'd have a little buffer for all the incoming messages and ISER when the batch frequency is which or when the chunk window says maximum size of file to outputs is which, then IoT will actually write to the data lake. So I will leave the settings by default and going on. The next thing is the encoding. IOT hub can create two type of files within your data lake. The first one is JSON file format, which is easy to manipulate with a lot of different services and even your own code. And the second one is AVRO, which is our specific format. We widely use in big data environment, along side with the Parquet format. And the latest thing we need to set up is we find name format.

If you want to actually change the way your file are organized within your data lake, you can do so there. So I will go and select JSON and click create. In case someone point is now created, but I actually need to do an extra step for wording all my events to the data lake. For this, I need to go back to the roots tab and it's clearly indicated that you actually need to add a route to actually direct all your messages to your newly created customer points. So I would do this by clicking add, type a name to a data lake, selecting an existing endpoint, and you can see I can find my new customer point here and the data source.

So what kind of that I want to actually store within my data lake. By default, I will select the device telemetry messages, which is all the messages send from the device to AUTM. But actually, you can also, if you need, put the device life cycle events. So everything happening on the lifecycle of the IOT device, you can also analyze them if you want. And I will ensure that I select enable the route. So the route will be active as soon as I click Save. And one of the cool things also is you can also have a rotating query, which means that you can actually decide on specific couturiers like type of devices or as a customer, what devices it's set up if you want to root these messages to a specific point.

For the demo, I will just let true value here. So I can send all the messages to my customer point. And I would just have to click save and the route is added. We just so how you can easily store all the incoming IoT data within the data lake. Now, we'll see how you can store the same data within a data warehouse. But before that, we need to take into account one thing, which is volumetry. IOT can output several events per second. And this number can grow depending on type of IOT devices and also the number of IOT devices. However, you don't want to store all these events within your data warehouse. You can definitely overload it within minutes. What we want most of the time is an aggregated view of is data within your data warehouse. We will see how we can use Azure Stream Analytics to store aggregated data directly from IOT hub into your existing warehouse.

We are back in the Azure portal on a Stream Analytics job. For creating a Stream Analytics job, you'll really need at least three different things. The first thing is an input. It's basically where you can get your streaming data. But as you can see here, I've added a new kind of data reference setup, and we'll see why I'm using two different sources of data within the Stream Analytics job. So next thing you will need is an output. Here, I've already created a data warehouse output. And last but not least, your query. It's basically a SQL query that allows you to aggregate your data from your different sources, and then output them into your output.

As you can see here, you can easily see the preview data from your different inputs, but also test you query on real data or on sample data. You can upload directly into the portal. In this specific query, I'm doing very simple things. I'm averaging my sense of data and I'm also combining my actual sensor data to device reference data. But it's a common scenario where we think the IOT or the message you receive just contain a technical ID. And within your data warehouse, you actually need to know about the sensor is actually set up in these specific room on this specific building. And what's exactly why I'm doing this with basically join between the raw data from the IoT hub and my reference data within an actual SQL database. The last thing I'm doing here is using the Stream Analytics windowing functions to actually aggregate all these incoming data over a period of two minutes.

So we only add one entry in data warehouse every two minutes. We just saw that integrating IoT data into your data analytics landscape is possible. Now, let's see some concrete example on how to get insights and actions from your IoT data. One of the most straightforward way to use your IoT data is by augmenting existing dashboard with them. There you can see a Power BI reports made by the IoT department of Contoso city.

On top of several traditional data, you now have at the bottom of your report, new data coming directly from an air quality IoT deployments. Let's dig into how this report is created. The IoT data comes from two distinct data sources. Air quality alerts are stored in an Azure SQL database from a Stream Analytics jobs. This job will only outfit a new line in the Azure SQL database when one of the sensor values go above a value. The graph of the bottom right is a bit different. It's a graph of the sensor readings with an interval of five minutes.

Here, all the processing is actually done in time series insights. We are just exposing these data within Power BI. And as you can see, you can have all the classic data sources from a Power BI reports like here or Postgres SQL database. In this demo, we will see how you can use IUT data in Power BI to unlock new insights. When your IT data is already interested in your data warehouse, displaying this information within Power BI is a very easy. You just need to click there and get data and select SQL server, and then type the name of your server and the name of the data warehouse. By clicking okay, Power BI will actually connect to your data warehouse. And then you can see all the table and load these information onto Power BI. I've already done this previously in this demo.

So I've already loaded the air quality measurements from my data warehouse and display the information in the graph on the left lower part of the screen. And I've also connected to another simple SQLs data source from business software managing the room occupancy. So as you can see in this graph, I have the room occupancy for specific room, and also the air quality from a specific room. If we look at the room occupancy, the data seems to be expected. We have no occupancy at all during the evening and the night, and we have normal occupancy during the day, even if Wednesday seems to be a higher occupancy day. If we look at the air quality graph, again, we see a pattern that we should expect, which is lower air quality during the day than during the nights. And again, on Wednesday, a bit high CO2 level. But one of the great things you can extract from IoT data is insights. And with two slides Power BI, you can actually correlate your existing or personal data with IoT data to uncover new insights.

Let's try to do this with these two data. I would first add a new graph here where I will first add again the data from the IoT system. So I will use the date and display the CO2 on this graph. So it's basically the same one, but this one. Now, what I will do is I will add on the same graph the room occupancy data coming from my operational database and see what's happens. So are we go and room occupancy and drag these on secondary values.

Now, I have the two graph together. I can see some interesting patterns. As you can see, we should expect that during the night where the occupancy is zero, the CO2 level should decline, but it's actually increasing on Tuesday evening. Another pattern we can see is after Wednesday on Wednesday evening, we should see a decrease in CO2 level, but actually is a bit increasing. These patterns could not be identified if we look at the IoT data and your existing operational data separately. But because we've added these two data together, we can uncover some insights. And here, maybe we need to go check out that room, and maybe there's an issue with the recycling system.

Now, in this architecture, all our reporting is based on strict data ingested all day long via Azure system analytics. In some architectures, we also see this processing happening in batch mode once a day. You can implement with ETL process from raw data storing the data lake with tools like Azure Data Factory over data after features of synapse analytics. Data stores within the data lake can be organized. In vigorous scenarios, you can also use these architecture to refine raw data into more structure and enriched form, also stored in the data lake In this demo, we will see how you can easily query historical data with Azure Synapse Analytics. Let's go back to our air quality scenario.

We saw in the first demo that our data is stored into two storages. All data is stored as is within the data lake, and all the aggregated data is stored in the data warehouse. Now, imagine the air quality standards are about to change. From an hourly average maximum value, we're switching to a single measurement maximum value, and that result value is changing as well. So we need to use that raw data store in the data lake to answer that question. However, data is answered in a forum four-door . We will see how we can use Azure Stream Analytics to easily tap into that raw data.

Let's jump into the Azure Synapse Analytics workplace. From there, you can simply create a SQL script where you can actually tap into that are stored directly within your data lake. With this first query, I'm just a really reading a bunch of digital files and getting the raw data as one document per line. Here, you don't see it, but I'm actually tapping into a huge number of directors. But still data is still on structure. So I can start using some JSON parsing features of his synchro language. So I can easily process these JSON files and actually get a nice table formatted system. From there, I can easily build a query where I can actually do my filtering, like what I'm doing here. And as you can see, within seconds, I'm able to execute a query on a huge number of servers, tapping into all the historical data I have within my data lake and within seconds. 

Here, in this demonstration, I'm using the sequel on-demand offering, which means that I'm not processing or provisioning any specific hardware before executing my query. I will basically be built for risk query itself. That's why you may take a little bit time to actually execute that query. But if you have the same amount of usage and you need the same amount of output, you can actually provision your own SQL clusters within synopsis analytics. Here, you can see that my new threshold have been only crossed three times in the past year on two different devices.

We just saw that Azure Synapse Analytics can process raw data from a data lake easily, but it can also be used to store data. In fact, all the demos you've seen so far don't use Azure SQL data warehouse, but are using Synapse Analytics and as a hold. Using it with your architectures can allow you to reuse your existing data warehouse skillsets and applying them to whole new level. Data engineers are mostly asked to work on historical data. With Power BI and Azure, they can also easily work on real time analytics. By using Stream Analytics, you can process, aggregate and filter data coming from IoT hub in real-time and outputting these data to several destinations. or example, you can use the streaming data tile within an Azure dashboard to display these data in real time. No code, no migrant service, you just have to write a secret query.

In this demo, we'll see how we can create a real time dashboard with Power BI. In order to create a real time data source for Power BI, we actually need to jump back to the Azure portal within your Stream Analytics job and go ahead and create a new output. But when you click on the output tab, you can actually add several outputs to the same existing job. And here we can select the Power BI. The first thing we need to do is to actually authorize your Azure subscription to access your Power BI workspace. Once connected, you need to set up an outfit earlier that you will use within your query. For example, here Power BI. Then you will need to set up a data set name. We can say real time IoT and tab name. Then, click on save. To actually see this new data source within Power BI, we will first need to start your swim analytics job.

Now, let's switch to Power BI. Go to your workspace and on new, click dashboard. Type of a dashboard's name, like we all time air quality. Within the dashboard editor, you can click on edit at a tile and scroll down to go to custom streaming data. Click on next. And then you would see is a new dataset here configured within Azure Stream Analytics live IUT data. I double click on it. Click next. Now, I would customize the visualization. I will leave the default visualization type cards, and I would add a field value here. As you can see, I can get all my output columns from my secret query. And for example here, I will output the average ozone value.

I click on next, leaving the title here and click on apply. And as you can see, every five seconds average zone value will actually change. Hopefully yes, we can see as value has changed. It's not almost real time because within the Azure Stream Analytics query, we used a window infection with an output every five seconds. This allows us to get almost three or near time data outputted within Power BI. Azure Stream Analytics can process your IoT data in real time, both for your analytics store, but also for real-time dashboard needs. But why are you limiting yourself to catching and displaying data? When you know how to use analytics, you will step away from executing your code or even better, a workflow. Here, the same as Azure Stream Analytics jobs is also launching the execution of an Azure Logic app, a no code serverless, workflow service within Azure, all we need to execute actions in wide range of application, like sending a message in a specific channel in teams.

In this demo, we will see how to start real time actions with Azure Stream Analytics and Logic app. To send real time updates to Microsoft teams, we will actually use two Azure products. The first one is Azure Stream Analytics, like we saw in a previous demo. And we will use these products to actually get data from IoT hub and then launch new warfare execution with Logic app. And second, is Logic app.

Let's start with Stream Analytics. I've created a simple job with one input of the stream data from IoT hub, and one output. There's no Logic Cap outputs. So I've been using here a service bus queue between my Stream Analytics and my Logic app. And then, a simple query, getting all the data from the IoT hub, creating an average maximum value and minimum value over a two minutes time window, and then answering. I have some criteria here, her hard-coded average, maximum ozone threshold. However, as we saw in the first demo, I could have used a reference data and then having a fist trestles to be a bit more dynamic.

Let's switch to a Logic app designer. I've just created a brand new Logic app. And what we can do here is start with a trigger, in my case, our new message reserve service queue. I would actually connect to one of the service's cue. We found out B alerts human queue, and I will check every minute for a new message. When done this, I can add a new step to process this message. And actually, Logic app designer can connect to a huge number of online services. You can search from them directly here. And as I've already used a V Microsoft teams connector in the past, it's showing up here in the recent connectors. So let's go out and click on Microsoft teams. Here, you can see a list of actions you can take within Microsoft teams directly from Logic apps.

There is a ton of ways to send a message. In our case, we'll be post what we call an adaptive card, which is basically a way to send rich graphic information as card, directly within team's conversation. And I would go ahead and select to post your own additive card as a flow boat to a channel. I can now pick my team, and then select the channel. Here, I've created a dedicated channel for these incoming alerts, and I will be able now to add new parameter to actually output the message. And here, I will copy-paste the content of the Arab cart, which is basically a JSON file. I can now replace some values with values coming directly from the incoming message, and I can click it and click save. And my Logic app is ready.

Now, we can switch to Microsoft teams and see our alerts coming in real time. As you can see, you can personalize all the display of the card and you can display any information you want. We know how to complete IoT architecture analytics solutions. As you can see, we've added several services like time service insights to analyze timestamp data, but also heavily reuse Azure Stream Analytics to birth a neighbor, a real-time analytics and actions within Power BI and Microsoft teams, and also historical data warehouse with Azure Synapse Analytics. Actually, when you are this kind of architecture with a hot pass managing real-time messaging and alerts and a cold pass enabling historical analysis, this is what we call Lambda architecture. You can find more information about the Lambda architecture on the documentation linked at the end of this presentation.

What we saw today in this presentation regarding analytics is only a tip of the iceberg. You can also, when you're doing IoT, doing some analytics at the edge. Thanks to Azure SQL edge, which is a low poor version of SQL Azure and age streaming analytics, which is an approver version of Azure Stream Analytics. You can actually embed wide range of analytical scenario, both in the cloud and at the edge. For links to the relevant documentation, resources and demo used in this presentation, check out aka.ms/iot40/resources. If I were interested in this video recording, so materials can be found on GitHub.com at aka.ms/iot40. If you enjoyed the session and are interested in also topics covered in the IoT learning pass, you can find them all at aka.ms/iotlp. We covered quite a few topics in session, and would like to remind that we have created a collection of modules on Microsoft loan platform, which pertain to the topics in visitation. This allow you to interactively learn how to implement an architecture for IoT hub, create and use at our warehouse with Azure Synapse Analytics, or how to explore an organized timestamp data with timestamp insights. Go check out this correction on aka.ms/iot40/learn. This presentation and the IoT grown modules can help guide you on a path to official certification. If you're interested in obtaining accreditation that can help you stand out as a certified Microsoft Azure IoT developer, we recommend to get out on the AZ 220 certification. You can find details on topic covered and schedule an exam today at aka.ms/iot40/certification. You can also find related data analytics and data management certification on Microsoft. Thank you again for attending this session. Cheers.

About the Author

This open-source content has been provided by Microsoft Learn under the Creative Commons public license, which can be accessed here. This content, which can be accessed here, is subject to copyright.