Azure Data Lake Analytics and Big Data: an Introduction

Azure Data Lake Analytics and big data: an introduction

Azure Data Lake Analytics simplifies the management of big data processing using integrated Azure resource infrastructure and complex code.

We’ve previously discussed Azure Data Lake and Azure Data Lake Store. That post should provide you with a good foundation for understanding Azure Data Lake Analytics – a very new part of the Data Lake portfolio that allows you to apply analytics to the data you already have in Azure Data Lake Store or Azure Blog storage.

Azure Data Lake Analytics

According to Microsoft, Azure Data Lake Analytics lets you:

Analyze data of any kind and of any size.
Speed up and sharpen your development and debug cycles.
Use the new U-SQL processing language built especially for big data.
Rely on Azure’s enterprise-grade SLA.
Pay only for the processing resources you actually need and use.
Benefit from the YARN-based technology extensively tested at Microsoft.

Azure Data Lake Analytics allows users to focus on code and analytics logic without having to worry about the intricacies of hardware set up, management, and operation in a distributed environment. Data Lake Analytics works with various Azure data sources such as Azure Blob storage, and Azure SQL database. However, using Azure Data Lake Analytics with data kept in Azure Data Lake Store provides the most optimized performance for big data workloads. An image from Microsoft Azure beautifully represents the various technologies that combine to make Data Lake work:

(Image courtesy: Microsoft Azure)

The Azure Data Lake Analytics process

In basic terms, here are the steps for setting up an Azure Data Lake Analytics operation:

Create a Data Lake Analytics account.
Prepare the source data. You should have either an Azure Data Lake Store account or Azure Blob storage account.
Develop a U-SQL script.
Submit a job (U-SQL script) to your Data Lake Analytics account. The job reads from the source data, process the data as instructed in the U-SQL script, and then saves the output to either a Data Lake Store or Blob storage account.

Before getting started, it’s good to be aware of these details:

At this time, Azure Data Lake Analytics is available only in the EAST US 2 region.
Your Azure Data Lake Analytics and Azure Data Lake Store accounts must be in the same region. This is especially significant regarding Data Lake Store, as that’s where the job metadata and audit logs are kept.
Azure Data Lake Analytics supports only Azure Data Lake Store and Azure Blob Storage.

An Introduction to U-SQL:

Along with Data Lake, Microsoft introduced Azure U-SQL. In Microsoft’s own words:

Azure Data Lake Analytics includes U-SQL, a language that unifies the benefits of SQL with the expressive power of your own code. U-SQL’s scalable distributed query capability enables you to efficiently analyze data in the store and across relational stores such as Azure SQL Database.

U-SQL combines the power of SQL and C# with a high-abstraction of parallelism and distributed programming. U-SQL processes any kind and any size of data. Unlike Hive, which uses SQL-like syntax (HQL) and will only work with structured data, U-SQL works with any kind of data: structured and unstructured.
A U-SQL query might look like this:

@Result =
SELECT emp_id, city, COUNT(*) AS NumberOfEmployees
FROM @Employees
GROUP BY dept, city
ORDER BY NumberOfEmployees DESC, dept, city
FETCH FIRST 10 ROWS;

Look at the query. SELECT, COUNT, FROM, GROUP BY, ORDER BY, etc., certainly use SQL syntax, but the data types follow the C# format.
What does @Employees mean? It turns out to be:

@Employees =
EXTRACT emp_id int
, name string
, city string
, salary int
, country string
, phone_numbers int
FROM @INPUT_EMPLOYEESS
USING Extractors.Text(delimiter : '\t', quoting: true, encoding : Encoding.Unicode);

The rowset @Employees is being extracted from a file using Extractors.Text. But you can also use Outputters to convert the result into any desired format, like CSV.
The above example shows how U-SQL, in its simplest form, can:

Extract data from your source. The built-in Extractors library is used to read and schematize the OUTPUT file.
Transform using SQL and/or custom user-defined operators.
Output the result either into a file or into a U-SQL table to store it for further processing.

Readers might notice the similarity between Pig Latin script and U-SQL. Like Pig scripts, each U-SQL expression is assigned to a variable which is used in further processing. You can also deploy and register the code as an assembly in a U-SQL metadata catalog. This allows you – or anyone else – to reuse the code in future scripts. You will need to use REFERENCE ASSEMBLY <U-SQL_script_name>.

The power of U-SQL goes beyond the simple query given above. The U-SQL can also handle:

Operations over a set of files with patterns.
Using Partitioned Tables.
Federated Queries against Azure SQL DB.
Encapsulating your U-SQL code with Views, Table-Valued Functions, and Procedures.
SQL Windowing Functions.
Programming with C# User-defined Operators (custom extractors, processors).
Complex Types (MAP, ARRAY).
Using U-SQL in data processing pipelines.
U-SQL in a lambda architecture for IoT analytics.

Azure Data Lake Analytics pricing:

The Azure Data Lake Analytics query service is currently in preview and its pricing model will change after release. But before that, we need to understand what an Analytics Unit (AU) and completed jobs are.

An Analytics Unit (AU) is a compute container that is assigned to execute code in parallel.
A job is considered completed if:
- The job has run to completion and executed the script correctly.
- The job has failed due to an unhandled exception in user code.
Any job that fails during compilation is not considered a completed job and will not incur any cost.

Other standard charges like transactions and data transfer are excluded from Analytics pricing:
Each Azure Data Lake Analytics account has configurable quotas limiting the number of AUs that can be assigned to jobs and the number of concurrent jobs. However, you can increase the quota by contacting Microsoft.

Conclusion

Azure Data Lake analysis is an exciting space to explore and execute big data technologies. Data Lake technologies are built for the cloud and employ Microsoft’s user-friendly and simple approach to technology.