Using Azure Data Lake Store and Analytics
Azure Data Lake Store (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
ADLS is designed for big data analytics in a Hadoop environment. It is compatible with Hadoop Distributed File System (HDFS), so you can run your existing Hadoop jobs by simply telling them to use your Azure data lake as the filesystem.
Alternatively, you can use Azure Data Lake Analytics (ADLA) to do your big data processing tasks. It’s a service that automatically provisions resources to run processing jobs. You don’t have to figure out how big to make a cluster or remember to tear down the cluster when a job is finished. ADLA will take care of all of that for you. It is also simpler to use than Hadoop MapReduce, since it includes a language called U-SQL that brings together the benefits of SQL and C#.
In this course, you will follow hands-on examples to import data into ADLS, then secure, process, and export it. Finally, you will learn how to troubleshoot processing jobs and optimise I/O.
- Get data into and out of ADL Store
- Use the five layers of security to protect data in ADL Store
- Use ADL Analytics to process data in a data lake
- Troubleshoot errors in ADL Analytics jobs
- Anyone interested in Azure’s big data analytics services
- Database experience
- SQL experience (recommended)
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 37 minutes of high-definition video
- Many hands-on demos
The github repository for this course is at https://github.com/cloudacademy/azure-data-lake.
Data Lake Store provides five different layers of security: authentication, access control, network isolation, data protection, and auditing.
For authentication, it uses Azure Active Directory to verify a user’s identity. Users must be defined in the directory to access ADLS.
For access control, ADLS has two layers: roles and access control lists, or ACLs.
There are four basic roles. An owner of a data lake has full control over all the other accounts. A reader, in contrast, can only view details about other users, such as what role is assigned to each user. A contributor can manage some aspects of an account, such as deployments and alerts, but they can’t add or remove roles. A user access administrator can do that, though. You might be wondering why the user access administrator role is needed, considering that an owner can add and remove roles too. Well, an owner has full control over data, too.
Permissions for data, that is, files and folders, are handled using access control lists. Each entry in an ACL specifies the read, write, and execute permissions for that user or group. In this example, everyone has read, write, and execute permissions. Each access control list specifies these permissions for the owning user, owning group, named users, such as “alice”, in this case, and everyone else, also known as “other”.
What’s a little bit confusing is that the owner for an ACL is not exactly the same as the owner role that I mentioned in the last slide. The owning user is the person who created this file or folder. However, if a user has the owner role for account management purposes, then they are automatically added as an owner of every file and folder as well. In other words, every file or folder has multiple owners -- the person who created the file or folder and all of the people who have the owner role.
Let’s look at an example. If you’re not still viewing the SearchLog.tsv file, then open it again. Then click Access. First, it says what permissions I have on this file. I think this is a mistake in the message, because it says folder instead of file. Anyway, I have read, write, and execute permissions on this file.
It also says that I have superuser privileges on this account. That’s because I have the owner role for this data lake. That makes me a superuser, so I automatically have owner permissions on every file and folder. I can also change any of the permissions.
Under Owners, I’m listed twice. The first one is the owning user. I’m the owning user because I created this file. The second one is the owning group. This is kind of weird because there isn’t an Active Directory group with my user ID as the name. ADLS always sets the owning group to the user ID of the person who created the file, which isn’t the way it should work, but that seems to be the way it is. You can change the owning group if you have permissions to do that, but unfortunately, you can’t do that in the portal. You have to use the command-line to do it.
Then, under “Assigned permissions”, it doesn’t have any entries. If we had given permissions to specific users or groups, then those entries would show up here.
Finally, it shows the permissions for everyone else. In this case, they don’t have any permissions, so they won’t be able to access this file at all.
Speaking of permissions, what exactly do they mean? Read and write are pretty obvious. But what about execute? Can you actually run a file in the data lake? No, you can’t. In fact, the execute permission doesn’t mean anything for a file. Folders are a different story, though.
To list the contents of a folder, you need to have both read and execute permissions on the folder. Similarly, to create new files or folders in a folder, you have to have both write and execute on the folder. So, if you’re going to give any kind of permission to a folder, you have to give execute permission too, or it won’t work.
OK, let’s give it a try. Click on Add. If you don’t have any users or groups, then you’ll have to create one in Azure Active Directory. If you have a lot of users and groups, then it’ll be easiest to type the name of the one you want in the search field. I’m going to search for one I created called “The Other Guy”. Click on it and then click Select.
Now you need to set the permissions. Let’s give read and write access. There’s no need to give execute access, of course, since this is a file. Then click Ok. Now you can see that “The Other Guy” shows up under “Assigned permissions”.
So how do we change the permissions for everyone else? That’s easy. You can just check the boxes. You can do that for any of these entries, so we could also change the permissions for The Other Guy, if we wanted to. When you’re done making changes, click Save.
Be aware that it’s not enough to give a user or group permissions on an individual file. They also have to have permissions on all of the folders leading up to that file. So, in this example, The Other Guy wouldn’t be able to see the SearchLog.tsv file unless I also gave him read and execute permissions on the root folder.
OK, now for some best practices. If you need to add entries for multiple individual users, it’s usually much better to add an entry for a group that has the right members in it. So, for example, if you wanted the Engineering team to have access to this file, then you would just assign permissions to the Engineering group instead of all of the individual Engineering users. Aside from this being easier to set up and maintain, it’s often necessary because you can only add nine custom entries to an ACL.
It’s also a good idea to set default ACLs. When you apply default ACLs to a folder, then every file or folder that gets created in it will have those ACLs too. Default ACLs have exactly the same structure as regular ACLs. I’ll show you an example.
Go back to the root folder and click “New folder”. I’ll call it “example”. Then click on it and click Access. Now click Add. I’ll add The Other Guy again. Then check Read and Execute.
Notice that there are some extra options because this is a folder. First, you can choose whether to add these permissions to only this folder or to this folder and all of its children. This is kind of weird because this folder doesn’t have any children, so it won’t make a difference which of these options we select.
Below that, you can choose whether to just set the access permissions for this folder, or make this a default ACL entry, or do both. Let’s do both. Then click OK.
The new access entry should show up below. To see the default ACL entry, click on Advanced. There it is. Now go back to the folder and create a subfolder called “test”. Then go into the test folder and click Access. The Other Guy has read and execute permissions here too because of the default ACL entry. Also, if you click Advanced, you’ll see that this folder also has the same default ACL entry, so if you create anything under this folder, it will have permissions for The Other Guy too. That’s because default ACLs are passed along to their children when the children are created. They aren’t copied to the children if the children already exist, though. OK, that’s it for access control.
The third layer of security is network isolation. You can actually set up a firewall just for your data lake. Exit the data explorer and then select Firewall in the Settings menu. The only thing this firewall allows you to do is restrict access to particular IP addresses. To do that, just fill in these blanks to create a rule. For the name, you can use whatever you want. Then you put in the first IP address and the last IP address in the range you want to allow access from.
To add your own IP address as a rule, click “Add client IP”. The start and end IPs are the same because it’s only your own IP address and not a range.
When you’re done adding rules, set “Enable firewall” to On. If you want to access your data lake using other Azure services, such as Data Lake Analytics, then you have to set “Allow access to Azure services” to On. Now that you know how the firewall works, I recommend turning it off to avoid any potential access issues while you’re learning how to use ADLS.
The fourth layer of security is data protection. ADLS supports encryption of data both at rest and in transit. Data in transit is always encrypted using HTTPS. The encryption of data at rest is configured when you create your data lake, as you saw when we did that earlier. By default, encryption of data at rest is enabled and it’s managed by ADLS. If you want to do your own key management, you can, but you have to decide that when you create the data lake. You can’t change it later.
Click on Encryption to see which configuration option you chose. As you can see, mine is encrypted using keys managed by Azure Data Lake Store, because I chose the default option when I created this data lake.
The fifth layer of security is auditing. ADLS logs all account management activities. To see them, click “Activity log”. Since we haven’t done much yet, there’s only one entry. It was added when we created this data lake.
That’s it for security. In the next lesson, we’ll actually do something with the data we uploaded, so if you’re ready, then go to the next video.
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).