We’ve previously discussed Azure Data Lake and Azure Data Lake Store. That post should provide you with a good foundation for understanding Azure Data Lake Analytics – a very new part of the Data Lake portfolio that allows you to apply analytics to the data you already have in Azure Data Lake Store or Azure Blog storage.
According to Microsoft, Azure Data Lake Analytics lets you:
Azure Data Lake Analytics allows users to focus on code and analytics logic without having to worry about the intricacies of hardware set up, management, and operation in a distributed environment. Data Lake Analytics works with various Azure data sources such as Azure Blob storage, and Azure SQL database. However, using Azure Data Lake Analytics with data kept in Azure Data Lake Store provides the most optimized performance for big data workloads. An image from Microsoft Azure beautifully represents the various technologies that combine to make Data Lake work:
(Image courtesy: Microsoft Azure)
In basic terms, here are the steps for setting up an Azure Data Lake Analytics operation:
Before getting started, it’s good to be aware of these details:
Along with Data Lake, Microsoft introduced Azure U-SQL. In Microsoft’s own words:
Azure Data Lake Analytics includes U-SQL, a language that unifies the benefits of SQL with the expressive power of your own code. U-SQL’s scalable distributed query capability enables you to efficiently analyze data in the store and across relational stores such as Azure SQL Database.
U-SQL combines the power of SQL and C# with a high-abstraction of parallelism and distributed programming. U-SQL processes any kind and any size of data. Unlike Hive, which uses SQL-like syntax (HQL) and will only work with structured data, U-SQL works with any kind of data: structured and unstructured.
A U-SQL query might look like this:
@Result = SELECT emp_id, city, COUNT(*) AS NumberOfEmployees FROM @Employees GROUP BY dept, city ORDER BY NumberOfEmployees DESC, dept, city FETCH FIRST 10 ROWS;
Look at the query. SELECT, COUNT, FROM, GROUP BY, ORDER BY, etc., certainly use SQL syntax, but the data types follow the C# format.
What does @Employees mean? It turns out to be:
@Employees = EXTRACT emp_id int , name string , city string , salary int , country string , phone_numbers int FROM @INPUT_EMPLOYEESS USING Extractors.Text(delimiter : '\t', quoting: true, encoding : Encoding.Unicode);
The rowset @Employees is being extracted from a file using Extractors.Text. But you can also use Outputters to convert the result into any desired format, like CSV.
The above example shows how U-SQL, in its simplest form, can:
Readers might notice the similarity between Pig Latin script and U-SQL. Like Pig scripts, each U-SQL expression is assigned to a variable which is used in further processing. You can also deploy and register the code as an assembly in a U-SQL metadata catalog. This allows you – or anyone else – to reuse the code in future scripts. You will need to use REFERENCE ASSEMBLY <U-SQL_script_name>.
The power of U-SQL goes beyond the simple query given above. The U-SQL can also handle:
The Azure Data Lake Analytics query service is currently in preview and its pricing model will change after release. But before that, we need to understand what an Analytics Unit (AU) and completed jobs are.
Other standard charges like transactions and data transfer are excluded from Analytics pricing:
Azure Data Lake analysis is an exciting space to explore and execute big data technologies. Data Lake technologies are built for the cloud and employ Microsoft’s user-friendly and simple approach to technology.
It's Flash Sale time! Get 50% off your first year with Cloud Academy: all access to AWS, Azure, and Cloud…
In this blog post, we're going to answer some questions you might have about the new AWS Certified Data Engineer…
This is my 3rd and final post of this series ‘Navigating the Vocabulary of Gen AI’. If you would like…