In this course, we're going to review the features, concepts, and requirements that are necessary for designing data flows and how to implement them in Microsoft Azure. We’re also going to cover the basics of data flows, common data flow scenarios, and what all is involved in designing a typical data flow.
Learning Objectives
- Understand key components that are available in Azure that can be used to design and deploy data flows
- Know how the components fit together
Intended Audience
This course is intended for IT professionals who are interested in earning Azure certification and for those who need to work with data flows in Azure.
Prerequisites
To get the most from this course, you should have at least a basic understanding of data flows and what they are used for.
All right, welcome back. So, now that we have our pipeline and data flows defined, what we're going to do is build the data flow out so that it takes movies listed in a moviesdb.csv file. This file is stored in my storage account and what it's going to do is take the information from that CSV and it's going to aggregate the ratings of comedies between certain years, and then it's going to write that data back into the storage. Now, before we get started there, what we'll do is we'll bounce back over here. And what I want to do is go into the storage account. Right here, this's my storage 1872. This is where I have this CSV file already stored.
I don't want to waste your time walking you through how to create a storage account and how to upload a file to blob storage, that seems kind of pointless. So, what we're going to do here is open up my storage and we'll open the containers here. Now within our storage account here, we do have a container called sample data and this sample data naming convention matches the tutorial we were working through from Microsoft's site. If we open up sample data, you can see here, we have a moviesdb.csv file. Basically, it's just a list of movies in a comma-separated list. So, our data flow is going to take the information from this CSV and pour it in and do some transformations.
So, let's bounce back out to my data factory here. And now you'll notice here, we have an option to add a source for our data flow and that's what we're going to do here. We'll click 'Add source' and we'll call our source moviesdb, just to reflect the name of the CSV. Now, what we're going to do for dataset, we don't have a dataset here, so we need to create a new dataset, and this is the source for our data flow. So, we'll click 'New'. And for this exercise we're going to select the Azure Data Lake storage gen2 option and we'll continue. And now since it's a CSV, this is the format type of our data, it's a delimited text CSV, so we'll select delimited text and we'll next it or continue, I should say. And what we'll do here is we need to provide a name for our dataset. So, we'll call it moviesdb again to reflect what we're working with. And then what we'll do for the linked service is create a new linked service.
And since we're working with Azure Data Lake storage gen2, we'll just call this adls gen2. Now, in this demonstration and the underlying tutorial we're working through, Microsoft tells us that we're using the account key method for this tutorial. So, we're going to leave the authentication method set to account key. Now, we're working in the Burke's batteries subscription, and then we have to select the storage account name we were working with and this is where our CSV file lives. And if you remember back it was my storage 1872, and what we can do here is test the connection. It tells us it's successful. So, we'll go ahead and create the new linked service. Now at this point we need to define where we're pulling this information from.
Basically, we need to provide the path to our CSV file. Now this file path is going to be in the context of our container, and remember the container was called sample data in our storage account. So, basically we're going to go into sample data. We didn't have a directory, the file itself, the CSV. If you remember, if we bounce over here was just moviesdb.csv and it was right inside sample data. So, we leave directory empty and we provide the file name. Now this CSV file does have a header row in the file. And what I'll do here is I'll just bounce down to the actual file so you can see what it looks like here. So, this CSV has a movie title, genres, year, rating, rotten tomato. So, it has this different information in this CSV.
So, it does have a header row here. So, what we're going to do is check the box for first row as header and then we're going to import the schema from the connection or the store, which is basically our CSV file. So, we'll go ahead and 'OK'. Okay, so now what we can do now that we've created this dataset, which is what we were doing up here, we can select data preview and what this will do is provide me with a snapshot of my information. So, we'll go ahead and refresh and we can see our data here. So, what this does is verifies that our transformation is properly configured. So, now that we know we have some good transformation going, we could start adding some filters.
And to do that, we go over next to our moviesdb here and if we click the plus here, this allows us to add a transformation and that transformation is going to be a filter. So, I can search for filter or I can scroll down and find it. For this demo, we'll just do fil. Now we can see we have the filter row modifier. Now, what this is going to do is filter on years. So, we'll call it FilterYears. Now what we're going to do is filter on the incoming moviesdb stream and then we have to tell it what we want to filter on. Now, I'm not going to type this whole filter out because basically it's a reject's command, but what I'm going to do is paste it in and I'll explain what we're doing here, and what we'll do is we'll open the expression builder.
And there we go. And this expression is actually pretty self explanatory. You're looking for movies that were created or published I guess between 1910 and 2000 and that their genres include comedy or are like comedy. Basically, that's the equivalent of direct, it's a comedy. So, you use the two integer year expression greater than or equal to 1910. You're joining that with... You're joining that with the two integer year, less than or equal to 2000 and then you're joining that with genres of comedy. Now before we save and finish this, we can actually refresh this here for a data preview. And what this will do is fetch the data and tell us if our expression is working the way it's supposed to. All right,
so since we now know the expression is working, we'll go ahead and finish it. And then what we can do is click 'Data Preview' to make sure it actually pulls only the information we're looking for. So, we'll go ahead and refresh and now you can see we're only pulling movies between the years we defined and movies that have comedy listed as a genre. So with that, let's take a break and I'll see you over in the next demonstration where we'll pick up where we left off.
Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.
In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.
In his spare time, Tom enjoys camping, fishing, and playing poker.