Designing Data Flows in Azure
Data Flow Basics
Designing a Data Flow Solution
The course is part of this learning path
This Designing Data Flows in Azure course will enable you to implement the best practices for data flows in your own team. Starting from the basics, you will learn how data flows work from beginning to end. Though we do recommend an idea of what data flows are and how they are used, this course contains some demonstration lectures to really make sure you have got to grips with the concept. By better understanding the key components available in Azure to design and deploy efficient data flows, you will be allowing your organization to reap the benefits.
This course is made up of 19 comprehensive lectures including an overview, demonstrations, and a conclusion.
- Review the features, concepts, and requirements that are necessary for designing data flows
- Learn the basic principles of data flows and common data flow scenarios
- Understand how to implement data flows within Microsoft Azure
- IT professionals who are interested in obtaining an Azure certification
- Those looking to implement data flows within their organizations
- A basic understanding of data flows and their uses
Related Training Content
For more training content related to this course, visit our dedicated MS Azure Content Training Library.
With our Data Factory deployed, we can launch the user interface and create an actual pipeline. What this pipeline will ultimately do is take data from our Blob Storage and copy that data into a SQL database. So let's get started on the pipeline creation. To launch the Data Factory user interface, click on Author & Monitor from the Data Factory itself. What this will do is open the user interface in a separate tab. Now remember, you need to do this in either the Edge browser or in Google Chrome. As you can see here, our user interface has popped up, and it's actually a pretty sleek looking tool here. So to begin the actual pipeline creation, we'll click Create pipeline. Now, as you can see here, there's a lot going on on the screen. My personal opinion is that the Data Factory UI is a little bit too busy. They could probably break this stuff out a little bit better but it's what we have right now. So let's just go with it. The first thing we're going to do here in this General tab down here is give our pipeline a name. So we're going to give our pipeline a name. We'll call it CopyFromBlob. So our pipeline itself is called CopyFromBlobtoSQL. Now, what we need to do is specify an activity that this pipeline is going to perform. So we're gonna go under Move & Transform here, unless this has changed, there we go, and we're going to take the Copy Data activity and drag it into our pipeline designer surface here. After dragging the Copy Data activity to our pipeline designer surface, we need to give our activity a name. So what I'm going to do is I'm going to call this ActivityBlobToSQL. So now what we have is we have a pipeline called CopyFromBlobToSQL. And then what I've done is I named the actual activity ActivityBlobToSQL so we can differentiate what we're working with throughout these next few demonstrations. After providing a name for our copy activity, we need to provide a source and a sink. The source is where our data is going to come from, the sink is where we're going to copy it to.
The source is essentially going to be our blob storage while the sink will be the SQL database. So to configure our source, we're going to click on the Source tab here, and before we do that, you can see here Source has a little one superscript as does Sink. And this is telling us there's some activity that we need to do in this section. So we'll go ahead and click Source. And what we're going to do here is we're going to create a new dataset. That dataset is going to be Blob Storage because we're pouring from Slob Storage. So we'll select Azure Blob Storage here and then click Finish. What this does is open a new tab for the Blob dataset. In the General section here, we need to give the dataset a name. So we'll call this SourceBlob. After providing a name for our dataset, we can see here that connection is asking for some information. So let's click Connection here. In the Connection window, we're going to create a linked service. Essentially what this is going to do is link the pipeline to our actual blob storage down to the file that we're going to pull into our SQL database. So to do this, we're going to click New to create a new linked service, and then we have go give our new linked service a name. So we'll just call this MyLinkedService. After providing a name we need to select our storage account. And I'm using mydatasource9878 for this demo. And then once we've selected our storage account we can click Test connection. After we confirm that we can connect to our storage account we can click finish. Now what this is going to do is deploy this linked service that allows Data Factory to talk back to our actual storage account. Once the linked service is created, we're taken back to the page for our dataset settings. From here we're going to specify the file that we're going to pull into SQL. To do that, we click Browse next to File path. You can see here it's blank right now. We're going to fill this in by browsing. So click Browse here and then we'll double-click my data which was the container, the input which was the folder and then the data.txt file. After doing so, it fills in this information. Now, once we've specified this data.txt file, we need to make sure that Data Factory is seeing the file for what it is.
So what we wanna do is confirm that it sees the correct file format. To do that we just scroll down here and we can see under File Format Settings that it indeed sees it as text format and sees it as comma delimited. Now, if the source file uses different rows or columns, what we could do is click on Detect Text Format. What this would do is cause the Copy Data tool to detect the file format and delimiters again. If it still doesn't see them correctly, we could also override them. In this case, we're good. So what we'll do here is we'll preview our data to make sure it's reading the file. And on your screen here, you'll see there are two rows in our file, Tom Mitchell, and Jen Smith with two properties. We'll close the data preview and next what we need to do is import the schema. So we'll select Schema and then we'll click Import Schema. And here again we'll see that the application has detected two columns in the source data file. Now, the reason we're importing the schema is so that we can map columns from the source data to the sink data which is essentially the database. So now that we've imported our schema, what we can do is hop back over to our pipeline and then browse to the Source tab. What we wanna do is confirm that our source dataset is still our SourceBlob, which it is, and then what we'll do is we'll preview our data here to make sure everything still looks good, and it is. So we'll close this out. And then next we'll configure the sink. The sink is essentially the destination where we're pouring this data to. To do this, we click on Sink, and again here you'll notice a little superscript here, one, telling us that there's some activity that needs to be done. And then once we're in the sink tab, we'll click new to create a new sink dataset. Now, our sink dataset is going to be an Azure SQL database.
So we'll just search for SQL. Now, as you can see here Azure SQL Database is an option so we'll select it and then click Finish. As was the case with all the other resources that we've deployed so far, we're asked for a name. So we're going to call this SQL dataset after providing our name or go into the Connection tab to create the linked service that allows the SQL to bring that data in. As we did previously for our source, we'll create our new linked service for our destination by clicking New. Now what this new linked service is going to do is it's going to provide the connection string that Data Factory is going to use to connect to the SQL database at runtime. So for this new linked service, we'll give it a name and we'll call this SQLDBLinkedSvc. After giving it a name, we have to select our SQL server that we've deployed and here's SQL server 9878. This is the server that hosts the database that we're pouring the data into. After selecting our server name, we select the actual database. Next we have to specify an authentication type. You have a couple of different options here, Service Principal, Managed Identity or SQL Authentication depending on what you're using for your authentication. In our demo here we're using SQL authentication so we'll provide the user name and password to connect to the SQL server. Once we provide it our user name and password, we can test the connection to ensure good connectivity. And we can see here that the connection was successful. So we can go ahead and click Finish to complete the creation of our new linked service. With the linked service created, we have to select the actual table where we're going to pull this data into. So we've got the SQLDBLinkedSvc that specifies the SQL server as well as the SQL database. We now have to tell it what table to use within that database. So in the dropdown box for Table, we select our [dbo].[emp]. The [dbo].[emp] is a table that was created in the database as part of the setup for this demonstration.
After specifying the table, we then have to pull in the schema. So we'll import the schema just as we did on the source information. So what we're doing here is we're pulling the schema in from the database so we can match it up with the source information. Now you'll see here we pulled in three columns from our database. We pulled in an ID, a First Name and a Last Name. So these are the columns that exist in our SQL database that we're going to copy data to. Now the identity column here is a column in the SQL database that isn't going to receive any data from our source file. So what we're going to do here is we're going to delete this column. So what we're left with is First Name and Last Name. So what we're going to do next is configure our mapping. We're going to map the column names from our source data to those in our SQL database. If you remember back, the source data included two columns, Prop_0 and Prop_1. We're going to map those columns to First Name, Last Name in our database. To do that, we're going to go back to our pipeline tab and first we wanna make sure that our Sink Dataset is our SQL Dataset, which it is. There's no reason that would have changed. And then what we're going to do is click on the Mapping tab here. Then from here we'll import the schemas, then you'll see here that the mapping is seeing Prop_0 and Prop_1 from the initial source data and it also sees the First Name, Last Name. And what we're doing here is we're seeing that it automatically configures the mapping for us. And this is based on the order in which the fields were found.
So now that we've confirmed that the mapping is in place, what we wanna do is validate the pipeline. To do so, we simply click Validate. The message that we receive here tells us that our pipeline has been validated. It tells us that no errors were found. So it looks like our pipeline is good. If we click on the arrows up at the top here, we can see the code for our pipeline as well. With our pipeline validated, we can debug and publish the pipeline. So before publishing our pipeline we want to debug it. We wanna make sure everything runs as expected. Now, we know we've done the validation but we want to actually see the process run and make sure from beginning to end everything works as it's supposed to. So to debug the pipeline from within the pipeline tab, we click Debug. And what this does is provide the status of the pipeline in the Output tab down the bottom. It tells us it's in progress at this point. And what we can do is look at the glasses here for details. And you can see here it gives us some information about the duration of the actual process that was run as part of the pipeline. We can see the start time, the duration, how many files were read, how many rows were read and how many rows were written. So we can close this detail window and see that our debug was successful. Now once the pipeline runs successfully we can publish it. To publish the pipeline, we simply click Publish All in the top toolbar. Now what this is going to do is publish everything in the pipeline, the datasets, the pipelines, everything that we created in Data Factory. So we go ahead and click Publish All. You'll see here we receive a message saying it's published successfully and at this point our pipeline is active, it's live. Now there's two ways we can trigger the pipeline.
We can trigger it manually or we can trigger it on a schedule. What we're going to do here is trigger the pipeline manually. To trigger our pipeline manually, and essentially this is to test it to make sure it looks good in a production setting, to trigger this manually we're going to select Trigger on the toolbar here. And then we get two options, New or Trigger Now. We'll go ahead and trigger it now and it's going to tell us that this pipeline is going to run based on the last published configuration, which is fine because that's what we want. So we'll click Finish. And what this is going to do is trigger the pipeline to run manually. Now what we can do is select the Monitor button on the far left here to view the activity. We can see the Manual trigger was successful and it took 19 seconds. Now, under actions, you have a few different options. You can view the activity runs and you can rerun the actual trigger. If we click on the View Activity Runs, we'll see the activities that happened as part of this pipeline. We'll see there was a Copy activity type, when it started, the duration and it if succeeded or not. Then there's other options here, Input, Output and then we can view details. And the details essentially tell us what we were told before during the validation is that it's reading one file two rows and writing two rows. And then it gives us some information on the duration itself. So let's go back to our pipeline runs here and we can see we're good. So now that we know that the pipeline run was successful and that it created the two rows in our database, we can now trigger the pipeline on a schedule. So what we're going to do here is send a trigger to run our pipeline every minute until a specified end date and time. To create our schedule, we can click the Author button here or the little pencil and then from our pipeline tab, we can click Trigger. And instead of triggering now like we did when we manually run the pipeline, we're going to click New/Edit. And then from here it's to allow us to add a trigger.
So from the Choose trigger dropdown box we simply click New. We'll give our trigger a name and we'll call it EveryMinute and then we have our schedule type here. A scheduled trigger is a trigger that invokes a pipeline based on a typical wall clock schedule. Every so many minutes, every so many hours et cetera. A tumbling window trigger operates on a periodic interval. It also retains state. An event based trigger is essentially a trigger that responds to a specific event. We're going to use the schedule type now as you can see here, we specify the start date in UTC time. So this is essentially immediately. And for recurrence we're going to schedule it to run every minute. And then an End Date, we're going to specify. And what we'll do here is we'll select our day and then we'll schedule it just to end a few minutes from now and we'll apply it. Now once we specify the date and time here, we can see that it's activated. And what's going to happen here is we're going to activate this new trigger once it's published. So we'll click Next and before we click Finish here, we're getting a warning telling us, "Make sure to "Publish" for the trigger after clicking "Finish". So we'll Finish this and then from here we'll click Publish All. And this is going to deploy our changes to our Data Factory pipeline. So at this point, our schedule trigger is now configured. And as you can see here it's deploying.
So our changes have been published to our pipeline. So what we're going to do here is click on Monitor over on the left and then we'll refresh this every so often and we should see the new activity showing up within the Monitoring section here. And already we're seeing the first triggered pipeline run. And it tells us it's triggered by the Every Minute trigger. So as you can see, what we've essentially done is create a mini data flow using Azure Data Factory to pull data from Azure Blob Storage to a SQL database. We've configured that pipeline to pull in data manually and we've also configured it to run on a schedule.
About the Author
Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.
In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.
In his spare time, Tom enjoys camping, fishing, and playing poker.