Using Azure Data Lake Store and Analytics
Azure Data Lake Store (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
ADLS is designed for big data analytics in a Hadoop environment. It is compatible with Hadoop Distributed File System (HDFS), so you can run your existing Hadoop jobs by simply telling them to use your Azure data lake as the filesystem.
Alternatively, you can use Azure Data Lake Analytics (ADLA) to do your big data processing tasks. It’s a service that automatically provisions resources to run processing jobs. You don’t have to figure out how big to make a cluster or remember to tear down the cluster when a job is finished. ADLA will take care of all of that for you. It is also simpler to use than Hadoop MapReduce, since it includes a language called U-SQL that brings together the benefits of SQL and C#.
In this course, you will follow hands-on examples to import data into ADLS, then secure, process, and export it. Finally, you will learn how to troubleshoot processing jobs and optimize I/O.
- Get data into and out of ADL Store
- Use the five layers of security to protect data in ADL Store
- Use ADL Analytics to process data in a data lake
- Troubleshoot errors in ADL Analytics jobs
- Anyone interested in Azure’s big data analytics services
- Database experience
- SQL experience (recommended)
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 37 minutes of high-definition video
- Many hands-on demos
The github repository for this course is at https://github.com/cloudacademy/azure-data-lake.
In the last lesson, the processing job ran without any problems, but that won’t always be the case. Finding and fixing problems is usually pretty easy, though, because the vast majority of compilation errors fall into three categories.
First, U-SQL keywords must be in upper-case. For example, if you type “select” in lower-case instead of in upper-case, then you’ll get a compilation error. Another common problem is when you specify an input file that doesn’t exist. This is usually due to mistyping the filename or path. The third most common problem is using an invalid C# expression.
Let’s intentionally generate an error to see what happens. Bring up the script from the job we ran. If you’re still viewing the output file, then close it, and select the Script tab. If you’re not still there, then go into your Data Lake Analytics account and click “View all jobs”. Then click on the job and go to the Script tab.
We can modify this script and run another job by clicking “Reuse script”. Let’s make FROM lower-case. Now click Submit.
It won’t generate the error immediately. It usually takes 5 or 10 seconds. There, it failed. Now there’s an error tab. It says there’s a syntax error on line 9. Down here, it puts three hash marks where it found the error in the script.
To fix it, click “Reuse script” again and change “from” back to upper-case. Then submit it again. There, it worked again.
OK, let’s create a more subtle problem. Instead of outputting the @out RowSet, let’s output @searchlog. That won’t give us what we want, but it should still compile and run, right? Well, let’s see.
OK, it failed. The error is “This statement is dead code.” It put the hash marks here. It’s saying that we created the @out RowSet, but we didn’t do anything with it. All of the RowSets you create have to eventually lead to an output. That is, you don’t have to output every RowSet, but each one has to be fed into another one until eventually the last one is output. In this case, we didn’t output @out and we didn’t feed it into another RowSet, so it’s dead code. I won’t bother fixing this one, but let’s have a look at something else.
If you click on “All jobs”, you’ll get a list of your jobs and you can easily see which ones succeeded and which ones failed. If you go back to the Analytics dashboard, it’ll show you more.
Down here, it shows how many jobs succeeded and how many failed. The graph shows the number of jobs by date. Up here, it shows information about running jobs. I don’t have any jobs running right now, so these are all zero. It also shows the estimated cost of the jobs you’ve run.
Alright, one more thing. By default, ADLA archives your jobs for 30 days, so you can go back and review them or reuse them again. But after 30 days, they expire and get deleted. That’s because archiving them incurs storage fees, so deleting them is a cost-saving measure.
If you want to keep your jobs for longer than 30 days, go to the ADLA dashboard, and then click on Properties in the Settings menu. Under “Days to retain job queries”, you can change it to a number between 1 and 180.
You can also reduce the maximum number of jobs that can run at the same time (it’s set to 20 by default) and the maximum AUs to something less than 250. AU stands for “Analytics Unit” and it’s essentially one small virtual machine. So, by default, your jobs can run up to 250 VMs at the same time.
If you go back to the list of jobs, you can see that all of them ran with just one AU. When you create a new job, you can set the number of AUs to something higher if it’s a big job.
And that’s it for this lesson.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).