Using Azure Data Lake Store and Analytics
Azure Data Lake Store (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
ADLS is designed for big data analytics in a Hadoop environment. It is compatible with Hadoop Distributed File System (HDFS), so you can run your existing Hadoop jobs by simply telling them to use your Azure data lake as the filesystem.
Alternatively, you can use Azure Data Lake Analytics (ADLA) to do your big data processing tasks. It’s a service that automatically provisions resources to run processing jobs. You don’t have to figure out how big to make a cluster or remember to tear down the cluster when a job is finished. ADLA will take care of all of that for you. It is also simpler to use than Hadoop MapReduce, since it includes a language called U-SQL that brings together the benefits of SQL and C#.
In this course, you will follow hands-on examples to import data into ADLS, then secure, process, and export it. Finally, you will learn how to troubleshoot processing jobs and optimize I/O.
- Get data into and out of ADL Store
- Use the five layers of security to protect data in ADL Store
- Use ADL Analytics to process data in a data lake
- Troubleshoot errors in ADL Analytics jobs
- Anyone interested in Azure’s big data analytics services
- Database experience
- SQL experience (recommended)
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 37 minutes of high-definition video
- Many hands-on demos
The github repository for this course is at https://github.com/cloudacademy/azure-data-lake.
Before you can do anything with the data lake, you need to get data into it. There are many different ways to do that. The easiest way is to simply upload files or folders from your desktop. You can also import data from other Azure services, such as Storage Blob, SQL Database, HDInsight (which is Azure’s Hadoop cluster service), and Stream Analytics, although in most of those cases, you’ll have to use other tools to copy the data.
Finally, if you have a huge amount of data to ingest, you may want to use either the Azure Import/Export service, which lets you ship hard drives to Microsoft, or ExpressRoute, which lets you create a private connection between your infrastructure and an Azure data center.
Let’s go through the process of creating a data lake and uploading some data. You can do this from the Azure portal, from PowerShell, or from the Azure CLI. We’re going to the Azure portal, so open it in your browser.
In the menu on the left, click New and then “Data + Analytics”. Here’s “Data Lake Store”. Enter a name here. It has to be globally unique, so you’ll have to use a different name than I do. I’ll call mine “datalakeguy”. If you have multiple subscriptions, then choose the one you want. For the resource group, you can either create a new one or use an existing one. Of course, if you just created a trial account, then you won’t have any existing resource groups yet, so you’ll have to create a new one. I’ll do that too.
Now choose a location where the service will run. Data Lake Store isn’t supported in every region yet, but choose whichever one is closest to you. Leave the pricing package on Pay-as-You-Go.
Then click on “Encryption settings”. There are three options: don’t use encryption at all, use keys managed by Data Lake Store, or use your own keys. If you have regulatory compliance requirements, then you might need to use your own keys, but otherwise, using keys managed by Data Lake Store is a safe and easy option. After you’ve created a Data Lake Store account, you can’t change the encryption setting, so you’ll want to get this right. For this example, choose the default, and click OK.
Check the “Pin to dashboard” box so it’ll be easy to get to the data lake in the future. Then click Create. It says “Deployment in progress”. It will take a little while to finish, so I’ll fast forward.
Once it’s ready, click “Data explorer”. This shows you the files and folders in the data lake. Of course, we haven’t uploaded any data yet, so it’s empty. If you wanted to create a folder, you’d click “New folder”, but we don’t have to, so let’s just upload a file right here.
First we need a file to upload. In the github repository for this course, there’s a file called SearchLog.tsv. Open it. This is one of Microsoft’s sample data files. It contains some imaginary records showing search terms that people typed in a search engine. Notice that it doesn’t have any headers showing what’s in each of the columns, so we might have to add some later. Click the Raw button. Then right-click and select “Save As”.
Now go back to the Azure portal and click Upload. Click the folder icon on the right and select the SearchLog.tsv file you just downloaded. Then click the “Add selected files” button. It should upload quickly because it’s a small file. Now close this blade to go back to the Data Explorer.
You might need to click the refresh button to see the file. If you want to look at the contents of the file, you can click on it. It looks good.
In the next lesson, I’ll show you how to secure this data, so if you’re ready, then go to the next video.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).