In this course, we're going to review the features, concepts, and requirements that are necessary for designing data flows and how to implement them in Microsoft Azure. We’re also going to cover the basics of data flows, common data flow scenarios, and what all is involved in designing a typical data flow.
Learning Objectives
- Understand key components that are available in Azure that can be used to design and deploy data flows
- Know how the components fit together
Intended Audience
This course is intended for IT professionals who are interested in earning Azure certification and for those who need to work with data flows in Azure.
Prerequisites
To get the most from this course, you should have at least a basic understanding of data flows and what they are used for.
In this lecture, we're going to talk about the basics of data flows, and about common data flow scenarios.
These basic concepts are important because businesses need to know where they are performing well and where they are performing poorly. Businesses need to ask tough questions about where the business is and where it’s headed. Answers can be pulled from data that is collected by the business. There may even be cases where additional questions about the business may be asked in the future. As such, it’s critical to keep raw data around for some time, so any future questions can be answered.
The concept of data flow encompasses the initial ingestion of data, any required transformations, storage, and then, ultimately, analysis. Data flow is essentially about what needs to happen with data in order to meet business requirements and how it can be used to answer questions about the business. As such, it’s important to get it right. Without the right data flow, a business may not be able to find answers that it needs today, let alone answers it might need in the future.
So, let’s talk a bit about data flow scenarios. I want to talk a bit about Batch Processing, Stream Processing, and Hybrid. I also want to discuss ETL vs ELT and “Schema on read” vs “schema on write”. By better understanding common data flow scenarios, you can be better prepared to design and document data flows in production.
Batch processing is useful in cases where there is a large amount of data quickly moving into a system. Data such as this, processed on a schedule, is considered to be batch processed. Generally speaking, such batch processing jobs will inherently produce delays – and because such data is processed don a schedule, it’s not real-time.
An example of batch processing would be a case where data comes in over the course of a day. Think of a retail store at the mall. Throughout the day, sale data comes in fast and furious. At the end of the day, the retailer might run a batch process that analyzes all the sales data for the day.
The retailer will typically throw a bunch of compute power at the data that needs to be processed in a batch at the end of the day, or maybe even overnight. The data is processed, transformed into a usable format, and then stored for later access. The data flows from ingestion, to transformation, to storage – it’s a pretty typical case of batch processing.
Stream processing of data refers to the analysis of data as it’s received. While it may not quite be real-time processing, delays are typically sub-second. As data comes into the pipeline, analysis is performed and results are generally available within seconds, or even milliseconds.
Processing continuous data in-stream requires technology that can handle this. Technologies like Lambda are available to provide this type of stream processing. Lambda, incidentally, supports both batch and stream processing.
Stream processing, and even batch processing for that matter, will often lead to an eventual ETL process or an ELT process. ETL refers to “extract, transform, and load”, while ELT refers to “extract, load, and transform”. Both concepts refer to the process of obtaining data, extracting it from its source, and transforming it. The transformed data is then stored for analysis. ETL transforms the data before loading and storing it. ELT loads collected data before transforming it, which usually means data can be handled at greater scale.
We’ll talk about ETL and ELT in more detail later. For this lecture, I just wanted to make sure you understand where it fits into the overall data flow process.
In a hybrid processing data flow scenario, data that’s processed, used, and stored is generally distributed among cloud and on-systems. As such, the data flow itself will often travel from on-prem to the cloud, and maybe even vice versa. As such, it’s important to think about bandwidth – especially when dealing with a data flow that moves data from on-prem up to the cloud. It’s important to assess the size of the pipe, the latency, and the cost implications of data ingress/egress to and from the cloud.
A common hybrid scenario is one in which data is ingested from applications and systems that reside on-prem, and then stored and processed, and maybe even transformed, in the cloud. When designing a hybrid data flow that sources data from on-prem devices, applications, forms, or really any other type of on-prem source, a key consideration is how the connection from the on-prem environment to the cloud is constructed. While a site-to-site VPN might be sufficient, latency issues may dictate that an ExpressROute be considered instead.
Be it ETL or ELT, both patterns are commonly used to form most data flows in production. As mentioned previously, the ETL (or Extract, transform, and load) process begins with extracting data from one or more systems. The extracted data might be in the same format, or it can be in all different formats. Once the data is extracted, it is transformed into some usable format – typically in memory. The data is then stored so that it can be queried later. As such, ETL is considered a “schema on write”, because the data is first transformed into some standard format BEFORE it’s written to storage.
The ELT process, or (extract, load, and transform), essentially replicates data to storage as-is. Only after the data is written to storage, is it transformed into a usable format. This is referred to as “schema on read” because there is no schema enforced on the data during initial ingestion. Instead, the data is transformed AFTER it’s been stored and while it’s being used.
When considering ETL vs ELT, the main driver generally comes down to scale. Using ETL requires data to be transformed before it can be loaded. This means that there is lots of compute power needed. As such, this can negatively impact the ability to process large amounts of data.
Using ELT, instead, separates data ingestion from the transformation process. What this does is allow huge amounts of data to be ingested and loaded BEFORE it is transformed. By breaking the process apart in ELT, it’s possible to ingest lots more data than with ETL. You can essentially ingest data as fast as it’s written.
Ultimately, if scale is a concern, ELT is the preferred strategy over ETL.
Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.
In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.
In his spare time, Tom enjoys camping, fishing, and playing poker.