Azure HDInsight
Start course
1h 5m

In this course, we're going to review the features, concepts, and requirements that are necessary for designing data flows and how to implement them in Microsoft Azure. We’re also going to cover the basics of data flows, common data flow scenarios, and what all is involved in designing a typical data flow.

Learning Objectives

  • Understand key components that are available in Azure that can be used to design and deploy data flows
  • Know how the components fit together

Intended Audience

This course is intended for IT professionals who are interested in earning Azure certification and for those who need to work with data flows in Azure.


To get the most from this course, you should have at least a basic understanding of data flows and what they are used for.


In this lecture, I want to talk a bit about HDInsight. Azure HDInsight is another offering that will often be part of the data flow process. HDInsight is built for big data analysis. It’s a fully managed, open-source analytics service that’s been designed for enterprises. It’s a cloud service that allows for fast, easy, and cost-effective processing of large amounts of data. 

HDInsight supports many scenarios, including ETL, data warehousing, machine learning, and even IoT. HDInsight is a solution that’s going to work for them all. It also supports multiple cluster types, such as Hadoop, Spark, and more. The role that HDInsight plays in a data flow is the engine, if you will. It’s going to power transformations that need to be completed.

Using HDInsight offers the ability to seamlessly integrate with many different Azure data stores and services. Such services include things like Azure Cosmos DB and Data Lake Storage. HDInsight also integrates with other offerings, such as Blob Storage, Event Hubs, and Data Factory.

Because HDInsight can’t be paused, you might just want to create an instance of the service when some sort of analysis or transformation work is required, and then delete it when the work is complete. This would obviously be part of some sort of automation process and would typically be something you do when processing data infrequently in batches.


When designing a data flow that incorporates an HDInsight cluster, the programming languages that you use in other parts of the data flow are going to be dictated, in part, by the cluster type that you deploy. For example, if you use Spark, you might consider using Scala, Python, java, etc. That being said, it takes us back to one of the initial questions we touched on earlier in the course – what skillsets do I have in house? When designing a data flow that includes HDInsight, be sure to choose technologies and services that are supportable by your staff.

About the Author
Learning Paths

Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.

In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.

In his spare time, Tom enjoys camping, fishing, and playing poker.