1. Home
  2. Training Library
  3. Microsoft Azure
  4. Courses
  5. Introduction to Azure Data Lake Store and Analytics

I/O Optimization

The course is part of this learning path

Big Data Analytics on Azure
course-steps 2 certification 1 lab-steps 2

Contents

keyboard_tab
Introduction
1
Introduction
FREE2m 4s
2
Overview
FREE1m 24s
Using Azure Data Lake Store and Analytics
4
Security
11m 19s
5
Conclusion
8
play-arrow
Start course
Overview
DifficultyBeginner
Duration37m
Students471
Ratings
4.8/5
star star star star star-half

Description

Course Description

Azure Data Lake Store (ADLS) is a cloud-based repository for both structured and unstructured data. For example, you could use it to store everything from documents to images to social media streams.

ADLS is designed for big data analytics in a Hadoop environment. It is compatible with Hadoop Distributed File System (HDFS), so you can run your existing Hadoop jobs by simply telling them to use your Azure data lake as the filesystem.

Alternatively, you can use Azure Data Lake Analytics (ADLA) to do your big data processing tasks. It’s a service that automatically provisions resources to run processing jobs. You don’t have to figure out how big to make a cluster or remember to tear down the cluster when a job is finished. ADLA will take care of all of that for you. It is also simpler to use than Hadoop MapReduce, since it includes a language called U-SQL that brings together the benefits of SQL and C#.

In this course, you will follow hands-on examples to import data into ADLS, then secure, process, and export it. Finally, you will learn how to troubleshoot processing jobs and optimize I/O.

Learning Objectives

  • Get data into and out of ADL Store
  • Use the five layers of security to protect data in ADL Store
  • Use ADL Analytics to process data in a data lake
  • Troubleshoot errors in ADL Analytics jobs

Intended Audience

  • Anyone interested in Azure’s big data analytics services

Prerequisites

  • Database experience
  • SQL experience (recommended)
  • Microsoft Azure account recommended (sign up for free trial at https://azure.microsoft.com/free if you don’t have an account)

This Course Includes

  • 37 minutes of high-definition video
  • Many hands-on demos

Resources

The github repository for this course is at https://github.com/cloudacademy/azure-data-lake.

Transcript

In the last lesson, we saw how to add more processing power when running a job. That will help in many situations, but it won’t solve every performance problem. That’s because I/O can be a bottleneck.

When you’re transferring large amounts of data from your local infrastructure to the Azure cloud, there are three potential bottlenecks. First is the speed of your storage. Ideally, your local data should be on SSDs rather than spinning disks, and on storage arrays rather than individual disks.

Second, you should have a high-speed internal network. In particular, the network interface cards on your local machines should be as fast as possible.

Third, the network connection between your local infrastructure and the Azure cloud should be fast. If it’s a major bottleneck, then consider using a dedicated link with Azure ExpressRoute.

If your data source is also in Azure, then put it in the same region as the data lake, if possible.

Finally, configure your data ingestion tools for maximum parallelization.

The next area to look at is how the datasets in your data lake are structured. When your data is being processed, there is a per-file overhead, so if you have lots of small files, it can impact the performance of the job. If possible, your files should be at least 256 meg in size.

There are circumstances where you don’t want your files to be too big, though. If your files can’t be processed in parallel, then you should keep them below 2 gig in size. Images and binaries are examples of data that can’t easily be processed in parallel.

If you’re processing time series data, such as server metrics or stock market prices, then you can make your queries run faster if you partition the data by time period. That way, your queries can read in only the specific subset of the time series that’s needed. For example, you could have a folder structure that looks like this.

And that’s it for I/O optimization.

About the Author

Students16740
Courses41
Learning paths22

Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).