Using Azure Data Lake Store and Analytics
Azure Data Lake Store (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
ADLS is designed for big data analytics in a Hadoop environment. It is compatible with Hadoop Distributed File System (HDFS), so you can run your existing Hadoop jobs by simply telling them to use your Azure data lake as the filesystem.
Alternatively, you can use Azure Data Lake Analytics (ADLA) to do your big data processing tasks. It’s a service that automatically provisions resources to run processing jobs. You don’t have to figure out how big to make a cluster or remember to tear down the cluster when a job is finished. ADLA will take care of all of that for you. It is also simpler to use than Hadoop MapReduce, since it includes a language called U-SQL that brings together the benefits of SQL and C#.
In this course, you will follow hands-on examples to import data into ADLS, then secure, process, and export it. Finally, you will learn how to troubleshoot processing jobs and optimize I/O.
- Get data into and out of ADL Store
- Use the five layers of security to protect data in ADL Store
- Use ADL Analytics to process data in a data lake
- Troubleshoot errors in ADL Analytics jobs
- Anyone interested in Azure’s big data analytics services
- Database experience
- SQL experience (recommended)
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 37 minutes of high-definition video
- Many hands-on demos
The github repository for this course is at https://github.com/cloudacademy/azure-data-lake.
Once you’ve loaded some data into your data lake and secured it, you can start analyzing and processing it. One way to do this is to connect the data lake to a Hadoop cluster running on HDInsight. Data Lake Store is compatible with the Hadoop Distributed File System (or HDFS), so any Hadoop applications that use HDFS can use Data Lake Store without any modification. If you’re writing a new application or are able to modify an existing one, then you can get better performance by using the new Azure Data Lake Filesystem.
You don’t have to use Hadoop to analyze your data, though. You can also use Azure Data Lake Analytics. It’s a service that automatically provisions resources to run data processing jobs. You don’t have to figure out how big to make a cluster or remember to tear down the cluster when a job is finished. ADLA will take care of all of that for you. It doesn’t use a standard Hadoop implementation, though. It uses a combination of Hadoop and internal Microsoft technology. It also uses a different language called U-SQL. It’s a combination of SQL and C#.
Let’s do a simple processing task to see how all of this works. Go to the Azure portal and click New, “Data + Analytics”, then “Data Lake Analytics”. The name has to be globally unique again. I’ll call mine “dlaguy”. Change the subscription if you need to. For the resource group, you can use the one you created in the last lesson. Then set your location to the same one you used for your data lake store. That’s not mandatory, but it will give you the best performance. Now click “Data Lake Store” and select the data lake you created before. Leave the pricing package on Pay-as-You-Go. Then check “Pin to dashboard” and click Create. I’ll fast forward to when it’s done.
Alright, now remember the search log we uploaded? We’re going to run a U-SQL job that will select the timestamp and the search string for all records that have en-gb for the language.
Click “New job”. You can call it whatever you want. Let’s go with “en-gb searches”. Now copy and paste this code from the github readme for this course.
In a U-SQL script, you create what are called RowSets. A RowSet, of course, is just a set of rows. In this script, @searchlog and @out are RowSets. @searchlog is created by extracting these seven columns from the SearchLog.tsv file. That’s actually all of the columns in the file. It didn’t have any headers, so we’ve named the columns here. We’ve also specified the datatype for each column. The final part of the EXTRACT statement is “USING Extractors.Tsv”. This is a built-in U-SQL function. It knows how to parse the data in a TSV file, so you don’t have to write any code to do that.
Then we create a RowSet called @out that selects the Time and Query columns from the @searchlog RowSet that was created above. We also add a WHERE clause that says to only select the rows where the language is en-gb.
Finally, we output the @out RowSet to a file called SearchLog_output.tsv using another built-in function called Outputters.Tsv.
OK, now click Submit. On the left, you can see what’s happening with the job. It’s preparing right now. I’ll fast forward. Now it’s done.
Here it give some details on the job. For example, it shows the estimated cost of the job. This was such a small job that it didn’t cost anything. Nice! It also shows the total duration of the job here.
Check out the graph. It shows what the job did. It used SearchLog.tsv as the data source. Then it extracted 2 rows. Vertex essentially means piece of work. The extract task was so small that it didn’t need to divide it into multiple parallel pieces of work. The PodAggregate step normally merges the results from multiple workers into one. In this case, there was only one worker, so there wasn’t much to do here. Finally, it saved the results in SearchLog_output.tsv.
We can take a look at the results in a couple of different ways. The easiest way is to just click on the output file in the job graph. It looks like the job did what it was supposed to. It extracted the timestamp and the query. We didn’t extract the language column, so we can’t easily tell whether it extracted the records with a language of en-gb, but we can double-check by looking at the input data again.
The other way to see the data is to go to the Data tab. There’s the input file. These are the two records it extracted and the language for both is en-gb.
If you want to copy the output to somewhere else, there are a few options. The easiest is to download it to your desktop. Go back to the output file. Now click Download. Then you have to click “Start download now”.
That was easy, but if you need to copy the output to anywhere else, it takes more work. One common requirement is to copy from Data Lake Store to SQL Data Warehouse. Fortunately, you can do that by creating an external data source in SDW and using Polybase to load the data from ADLS.
To export data to Azure SQL Database or Azure Storage Blobs, you can use the Hadoop tools on an HDInsight cluster. Apache Sqoop lets you copy data to and from SQL Database and Apache DistCp lets you copy data to and from Storage Blobs.
If you want to copy to a different service, such as Cosmos DB, then you’ll have to use Azure Data Factory. It’s a very flexible data pipeline tool, as you can see from this list of supported sources and sinks (that is, destinations).
It takes significantly more work, though. For example, here’s a Python script that copies data from one folder to another in Azure Blob Storage. It also performs some other activities related to the copy, but it’s a pretty long script regardless. So if you need to copy from Data Lake Store to another service, it’ll take a while to set up the first time.
And that’s it for this lesson.
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).