Microsoft Azure Data Lake Store: an Introduction

The Azure Data Lake Store service provides a platform for organizations to park – and process and analyse – vast volumes of data in any format. Find out how.

With increasing volumes of data to manage, enterprises are looking for appropriate infrastructure models to help them apply analytics to their big data, or simply to store them for undetermined future use. In this post, we’re going to discuss Microsoft’s entry into the data lake market, Azure Data Lake, and in particular, Azure Data Lake store.

What is a data lake?

In simple terms, a data lake is a repository for large quantities and varieties of both structured and unstructured data in their native formats. The term data lake was coined by James Dixon, CTO of Pentaho, to contrast what he called “data marts”, which handled the data reporting and analysis by identifying “the most interesting attributes, and to aggregate” them. The problems with this approach are that “only a subset of the attributes is examined, so only pre-determined questions can be answered,” and that “data is aggregated, so visibility into the lowest levels is lost.”

A data lake, on the other hand, maintains data in their native formats and handles the three Vs of big data (Volume, Velocity and Variety) while providing tools for analysis, querying, and processing. Data lake eliminates all the restrictions of a typical data warehouse system by providing unlimited space, unrestricted file size, schema on read, and various ways to access data (including programming, SQL-like queries, and REST calls).

With the emergence of Hadoop (including HDFS and YARN), the benefits of data lake – previously available only to the most resource-rich companies like Google, Yahoo, and Facebook – became a practical reality for just about anyone. Now, organizations who had been generating and gathering data on a large scale but had struggled to store and process them in a meaningful way, have more options.

Azure Data Lake

Azure Data Lake is the new kid on the data lake block from Microsoft Azure. Here is some of what it offers:

  • The ability to store and analyse data of any kind and size.
  • Multiple access methods including U-SQL, Spark, Hive, HBase, and Storm.
  • Built on YARN and HDFS.
  • Dynamic scaling to match your business priorities.
  • Enterprise-grade security with Azure Active Directory.
  • Managed and supported with an enterprise-grade SLA.

Azure Data Lake can, broadly, be divided into three parts:

  • Azure Data Lake store – The Data Lake store provides a single repository where organizations upload data of just about infinite volume. The store is designed for high-performance processing and analytics from HDFS applications and tools, including support for low latency workloads. In the store, data can be shared for collaboration with enterprise-grade security.
  • Azure Data Lake analytics – Data Lake analytics is a distributed analytics service built on Apache YARN that compliments the Data Lake store. The analytics service can handle jobs of any scale instantly with on-demand processing power and a pay-as-you-go model that’s very cost effective for short term or on-demand jobs. It includes a scalable distributed runtime called U-SQL, a language that unifies the benefits of SQL with the expressive power of user code.
  • Azure HDInsight – Azure HDInsight is a full stack Hadoop Platform as a Service from Azure. Built on top of Hortonworks Data Platform (HDP), it provides Apache Hadoop, Spark, HBase, and Storm clusters.

We’ve already been introduced to HDInsight in this series. Now we will discuss Azure Data Lake Store…which is still in Preview Mode.

Azure Data Lake Store

According to Microsoft, Azure Data Lake store is a hyper-scale repository for big data analytics workloads and a Hadoop Distributed File System (HDFS) for the cloud. It…

  • Imposes no fixed limits on file size.
  • Imposes no fixed limits on account size.
  • Allows unstructured and structured data in their native formats.
  • Allows massive throughput to increase analytic performance.
  • Offers high durability, availability, and reliability.
  • Is integrated with Azure Active Directory access control.

Some have compared Azure Data Lake store with Amazon S3 but, beyond the fact that both provide unlimited storage space, the two really don’t share all that much in common. If you want to compare S3 to an Azure service, you’ll get better mileage with the Azure Storage Service. Azure Data Lake store, on the other hand, provides an integrated analytics service and places no limits on file size. Here’s a nice illustration:
Azure Data Lake store - diagram

(Image Courtesy: Microsoft)

Azure Data Lake store can handle any data in their native format, as is, without requiring prior transformations. Data Lake store does not require a schema to be defined before the data is uploaded, leaving it up to the individual analytic framework to interpret the data and define a schema at the time of the analysis. Being able to store files of arbitrary size and formats makes it possible for Data Lake store to handle structured, semi-structured, and even unstructured data.

Azure Data Lake store file system (adl://)

Azure Data Lake Store can be accessed from Hadoop (available with an HDInsight cluster) using the WebHDFS-compatible REST APIs. However, Azure Data Lake store introduced a new file system called AzureDataLakeFilesystem (adl://). adl:// is optimized for performance and available in HDInsight. Data is accessed in the Data Lake store using:

adl://<data_lake_store_name>.azuredatalakestore.net

Azure Data Lake store security:

Azure Data Lake store uses Azure Active Directory (AAD) for authentication and Access Control Lists (ACLs) to manage access to your data. Azure Data Lake benefits from all AAD features including Multi-Factor Authentication, conditional access, role-based access control, application usage monitoring, security monitoring and alerting. Azure Data Lake store supports the OAuth 2.0 protocol for authentication within the REST interface. Similarly, Data Lake store provides access control by supporting POSIX-style permissions exposed by the WebHDFS protocol.

Azure Data Lake store pricing

Data Lake Store is currently available in US-2 region and offers preview pricing rates (excluding Outbound Data transfer):
Azure Data Lake store - cost

Conclusion

Azure Data Lake is an  important new part of Microsoft’s ambitious cloud offering. With Data Lake, Microsoft provides service to store and analyze data of any size at an affordable cost. In related posts, we will learn more about Data Lake Store, Data Lake Analytics, and HDInsight.

Avatar

Written by

Chandan Patra

Cloud Computing and Big Data professional with 10 years of experience in pre-sales, architecture, design, build and troubleshooting with best engineering practices. Specialities: Cloud Computing - AWS, DevOps(Chef), Hadoop Ecosystem, Storm & Kafka, ELK Stack, NoSQL, Java, Spring, Hibernate, Web Service


Related Posts

Patrick Navarro
Patrick Navarro
— January 22, 2020

Top 5 AWS Salary Report Findings

At the speed the cloud tech space is developing, it can be hard to keep track of everything that’s happening within the AWS ecosystem. Advances in technology prompt smarter functionality and innovative new products, which in turn give rise to new job roles that have a ripple effect on t...

Read more
  • AWS
  • salary
Alisha Reyes
Alisha Reyes
— January 6, 2020

New on Cloud Academy: Red Hat, Agile, OWASP Labs, Amazon SageMaker Lab, Linux Command Line Lab, SQL, Git Labs, Scrum Master, Azure Architects Lab, and Much More

Happy New Year! We hope you're ready to kick your training in overdrive in 2020 because we have a ton of new content for you. Not only do we have a bunch of new courses, hands-on labs, and lab challenges on AWS, Azure, and Google Cloud, but we also have three new courses on Red Hat, th...

Read more
  • agile
  • AWS
  • Azure
  • Google Cloud Platform
  • Linux
  • OWASP
  • programming
  • red hat
  • scrum
Alisha Reyes
Alisha Reyes
— December 24, 2019

Cloud Academy’s Blog Digest: Azure Best Practices, 6 Reasons You Should Get AWS Certified, Google Cloud Certification Prep, and more

Happy Holidays from Cloud Academy We hope you have a wonderful holiday season filled with family, friends, and plenty of food. Here at Cloud Academy, we are thankful for our amazing customer like you.  Since this time of year can be stressful, we’re sharing a few of our latest article...

Read more
  • AWS
  • azure best practices
  • blog digest
  • Cloud Academy
  • Google Cloud
Avatar
Guy Hummel
— December 12, 2019

Google Cloud Platform Certification: Preparation and Prerequisites

Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2019, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the second consecuti...

Read more
  • AWS
  • Azure
  • Google Cloud Platform
Alisha Reyes
Alisha Reyes
— December 10, 2019

New Lab Challenges: Push Your Skills to the Next Level

Build hands-on experience using real accounts on AWS, Azure, Google Cloud Platform, and more Meaningful cloud skills require more than book knowledge. Hands-on experience is required to translate knowledge into real-world results. We see this time and time again in studies about how pe...

Read more
  • AWS
  • Azure
  • Google Cloud
  • hands-on
  • labs
Alisha Reyes
Alisha Reyes
— December 5, 2019

New on Cloud Academy: AWS Solution Architect Lab Challenge, Azure Hands-on Labs, Foundation Certificate in Cyber Security, and Much More

Now that Thanksgiving is over and the craziness of Black Friday has died down, it's now time for the busiest season of the year. Whether you're a last-minute shopper or you already have your shopping done, the holidays bring so much more excitement than any other time of year. Since our...

Read more
  • AWS
  • AWS solution architect
  • AZ-203
  • Azure
  • cyber security
  • FCCS
  • Foundation Certificate in Cyber Security
  • Google Cloud Platform
  • Kubernetes
Avatar
Cloud Academy Team
— December 4, 2019

Understanding Enterprise Cloud Migration

What is enterprise cloud migration? Cloud migration is about moving your data, applications, and even infrastructure from your on-premises computers or infrastructure to a virtual pool of on-demand, shared resources that offer compute, storage, and network services at scale. Why d...

Read more
  • AWS
  • Azure
  • Data Migration
Wendy Dessler
Wendy Dessler
— November 27, 2019

6 Reasons Why You Should Get an AWS Certification This Year

In the past decade, the rise of cloud computing has been undeniable. Businesses of all sizes are moving their infrastructure and applications to the cloud. This is partly because the cloud allows businesses and their employees to access important information from just about anywhere. ...

Read more
  • AWS
  • Certifications
  • certified
Avatar
Andrea Colangelo
— November 26, 2019

AWS Regions and Availability Zones: The Simplest Explanation You Will Ever Find Around

The basics of AWS Regions and Availability Zones We’re going to treat this article as a sort of AWS 101 — it’ll be a quick primer on AWS Regions and Availability Zones that will be useful for understanding the basics of how AWS infrastructure is organized. We’ll define each section,...

Read more
  • AWS
Avatar
Dzenan Dzevlan
— November 20, 2019

Application Load Balancer vs. Classic Load Balancer

What is an Elastic Load Balancer? This post covers basics of what an Elastic Load Balancer is, and two of its examples: Application Load Balancers and Classic Load Balancers. For additional information — including a comparison that explains Network Load Balancers — check out our post o...

Read more
  • ALB
  • Application Load Balancer
  • AWS
  • Elastic Load Balancer
  • ELB
Albert Qian
Albert Qian
— November 13, 2019

Advantages and Disadvantages of Microservices Architecture

What are microservices? Let's start our discussion by setting a foundation of what microservices are. Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs). ...

Read more
  • AWS
  • Docker
  • Kubernetes
  • Microservices
Nisar Ahmad
Nisar Ahmad
— November 12, 2019

Kubernetes Services: AWS vs. Azure vs. Google Cloud

Kubernetes is a popular open-source container orchestration platform that allows us to deploy and manage multi-container applications at scale. Businesses are rapidly adopting this revolutionary technology to modernize their applications. Cloud service providers — such as Amazon Web Ser...

Read more
  • AWS
  • Azure
  • Google Cloud
  • Kubernetes