Designing Data Flows in Azure
Data Flow Basics
Designing a Data Flow Solution
The course is part of this learning path
This Designing Data Flows in Azure course will enable you to implement the best practices for data flows in your own team. Starting from the basics, you will learn how data flows work from beginning to end. Though we do recommend an idea of what data flows are and how they are used, this course contains some demonstration lectures to really make sure you have got to grips with the concept. By better understanding the key components available in Azure to design and deploy efficient data flows, you will be allowing your organization to reap the benefits.
This course is made up of 19 comprehensive lectures including an overview, demonstrations, and a conclusion.
- Review the features, concepts, and requirements that are necessary for designing data flows
- Learn the basic principles of data flows and common data flow scenarios
- Understand how to implement data flows within Microsoft Azure
- IT professionals who are interested in obtaining an Azure certification
- Those looking to implement data flows within their organizations
- A basic understanding of data flows and their uses
Related Training Content
For more training content related to this course, visit our dedicated MS Azure Content Training Library.
In this lecture, let's talk a bit about data flow scenarios. Specifically, I want to talk a bit about batch processing, stream processing, and hybrid. I also want to discuss ETL vs ELT and schema on read versus schema on write. By better understanding common data flow scenarios, you can be better prepared to design and document data flows in production. Batch processing is useful in cases where there is a large amount of data quickly moving into a system. Data such as this, processed on a schedule, is considered to be batch processed. Generally speaking, such batch processing jobs will inherently produce delays and because such data is processed don a schedule, it's not real-time. An example of batch processing would be a case where data comes in over the course of a day. Think of a retail store at the mall. Throughout the day, sales data comes in fast and furious. At the end of the day, the retailer might run a batch process that analyzes all the sales data for the day. The retailer will typically throw a bunch of compute power at the data that needs to be processed in a single batch at the end of the day or maybe even overnight.
The data is processed, transformed into a usable format, and then stored for later access. The data flows from ingestion, to transformation, to storage. It's a pretty typical case of batch processing. Stream processing of data refers to the analysis of data as it's received. While it may not quite be real-time processing, delays are typically sub-second. As data comes into the pipeline, analysis is performed and results are generally available within seconds or even milliseconds. Processing continuous data in-stream requires technology that can handle this. Technologies like Lambda are available to provide this type of stream processing. Lambda, incidentally, supports both batch and stream processing. Stream processing or even batch processing, for that matter, will often lead to an eventual ETL process or an ELT process. ETL refers to extract, transform, and load, while ELT refers to extract, load, and transform. Both concepts refer to the process of obtaining data, extracting it from its source, and then transforming it. The transformed data is then stored for analysis later. ETL transforms the data before loading and storing it. ELT loads collected data before transforming it, which usually means data can be handled at greater scale.
In a hybrid processing data flow scenario, data that's processed, used, and stored is generally distributed among cloud and on-prem systems. As such, the data flow itself will often travel from on-prem to the cloud and maybe even vice versa. As such, it's important to think about bandwidth, especially when dealing with a data flow that moves data from on-prem up to the cloud. It's important to assess the size of the pipe, the latency, and the cost implications of data ingress and egress to and from the cloud. A common hybrid scenario is one in which data is ingested from applications and systems that rely on-prem and then stored and processed, and maybe even transformed, in the cloud. When designing a hybrid data flow that sources data from on-prem devices, application, forms, or really any other type of on-prem source, a key consideration is how the connection from the on-prem environment to the cloud is constructed. While a site-to-site VPN might be sufficient, latency issues may dictate that an ExpressRoute be considered instead. Whether you use ETL or ELT, both patterns are commonly used to form most data flows in production. As mentioned previously, the ETL or extract, transform, and load process, begins with extracting data from one or more systems. The extracted data might be in the same format or it can be in all different formats. Once the data is extracted, it's transformed into some usable format, typically in memory. The data is then stored, so that it can be queried later. As such, ETL is considered a schema on write because the data is first transformed into some standard format or schema before it's written to storage. The ELT process or extract, load, and transform process, essentially replicates data to storage as-is. Only after the data is written to storage is it transformed into a usable format.
This is referred to as schema on read because there is no schema enforced on the data during initial ingestion. Instead, the data is transformed after it's been stored and while it's being used. When considering ETL vs ELT, the main driver generally comes down to scale. Using ETL requires data to be transformed before it can be loaded. This means that there's lots of compute power needed. As such, this can negatively impact the ability to process large amounts of data. Using ELT, instead, separates data ingestion from the transformation process. What this does is allow huge amounts of data to be ingested and loaded before it's transformed. By breaking the process apart in ELT, it's possible to ingest lots more data than with ETL. You can essentially ingest data as fast as it's written. Ultimately, if scale is a concern, ELT is the preferred strategy over ETL.
About the Author
Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.
In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.
In his spare time, Tom enjoys camping, fishing, and playing poker.