Using Azure Data Lake Store and Analytics
Azure Data Lake Store (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.
ADLS is designed for big data analytics in a Hadoop environment. It is compatible with Hadoop Distributed File System (HDFS), so you can run your existing Hadoop jobs by simply telling them to use your Azure data lake as the filesystem.
Alternatively, you can use Azure Data Lake Analytics (ADLA) to do your big data processing tasks. It’s a service that automatically provisions resources to run processing jobs. You don’t have to figure out how big to make a cluster or remember to tear down the cluster when a job is finished. ADLA will take care of all of that for you. It is also simpler to use than Hadoop MapReduce, since it includes a language called U-SQL that brings together the benefits of SQL and C#.
In this course, you will follow hands-on examples to import data into ADLS, then secure, process, and export it. Finally, you will learn how to troubleshoot processing jobs and optimize I/O.
- Get data into and out of ADL Store
- Use the five layers of security to protect data in ADL Store
- Use ADL Analytics to process data in a data lake
- Troubleshoot errors in ADL Analytics jobs
- Anyone interested in Azure’s big data analytics services
- Database experience
- SQL experience (recommended)
- Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)
This Course Includes
- 37 minutes of high-definition video
- Many hands-on demos
The github repository for this course is at https://github.com/cloudacademy/azure-data-lake.
I hope you enjoyed learning about Azure Data Lake Store and Analytics. Let’s do a quick review of what you learned.
ADLS is primarily used to do data analytics on both structured and unstructured data in a Hadoop ecosystem, while Azure SQL Data Warehouse is primarily used for business reporting on structured data in a SQL Server ecosystem.
You can import data into ADLS from a wide variety of sources, including other Azure services, such as Storage Blob, SQL Database, HDInsight, and Stream Analytics, as well as from your local infrastructure, either by doing file uploads or by using the Import/Export or ExpressRoute services.
ADLS has five layers of security. For authentication, it uses Azure Active Directory to verify a user’s identity. For access control, it has roles for user management and access control lists to restrict access to data. If you need to assign the same permissions to multiple users, then you should create ACL entries for groups rather than individuals. Also, to ensure consistent permissions, set default ACLs on folders, when possible.
To achieve network isolation, you can use the firewall to restrict access to particular IP addresses. ADLS handles data protection by always encrypting data in flight using HTTPS and giving you a choice of key management solutions to encrypt data at rest. The final layer of security is auditing. All account management activities are logged for later inspection.
There are two ways to process data. First, you can use a Hadoop cluster on HDInsight. ADLS is compatible with HDFS, so you can either access it that way or using its native filesystem.
Second, you can use Azure Data Lake Analytics, which brings clusters up and down to run jobs. It uses U-SQL, which is a combination of SQL and C#.
To export data, you use different tools for different destinations. To export to SQL Data Warehouse, create an external data source and use Polybase. If you’re exporting from an HDInsight cluster, use Apache Sqoop for Azure SQL Database and Apache DistCp for Azure Storage Blobs. For all other destinations, use Azure Data Factory.
The most common U-SQL errors are: not using upper-case for U-SQL keywords, specifying an input file that doesn’t exist, and invalid C# expressions.
To optimize your performance, use high-speed storage and networking for data ingestion;
increase the number of Analytics Units for bigger jobs; use files that are at least 256MB in size; keep files that can’t be processed in parallel less than 2GB in size; and partition time series data so you can process subsets of it.
Now you know how to get data into and out of ADL Store; use the five layers of security to protect data in ADL Store; use ADL Analytics to process data in a data lake; and troubleshoot errors in ADL Analytics jobs.
To learn more about Azure Data Lake Store and Analytics, you can read Microsoft’s documentation. Also watch for new big data courses on Cloud Academy, because we’re always publishing new courses.
If you have any questions or comments, please let us know by clicking the “Report an issue” button below. Thanks and keep on learning!
About the Author
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).