The Data Lifecycle | DAL4 A5.1 |
The concept of data flow involves transforming, storing, and analyzing data in order to answer questions about your organization. It helps you to understand where your organization is performing well, and where you could improve. It can also give you an insight into what the future of your organization might look like.
In this video course, you'll learn the basics of data flow, including the data lifecycle, and look at some common data flow scenarios.
In this lecture, let's talk a bit about data flow scenarios. Specifically, I want to talk a bit about batch processing, stream processing, and hybrid. I also want to discuss ETL vs ELT and schema on read versus schema on write. By better understanding common data flow scenarios, you can be better prepared to design and document data flows in production. Batch processing is useful in cases where there is a large amount of data quickly moving into a system. Data such as this, processed on a schedule, is considered to be batch processed. Generally speaking, such batch processing jobs will inherently produce delays and because such data is processed don a schedule, it's not real-time. An example of batch processing would be a case where data comes in over the course of a day. Think of a retail store at the mall. Throughout the day, sales data comes in fast and furious. At the end of the day, the retailer might run a batch process that analyzes all the sales data for the day. The retailer will typically throw a bunch of compute power at the data that needs to be processed in a single batch at the end of the day or maybe even overnight.
The data is processed, transformed into a usable format, and then stored for later access. The data flows from ingestion, to transformation, to storage. It's a pretty typical case of batch processing. Stream processing of data refers to the analysis of data as it's received. While it may not quite be real-time processing, delays are typically sub-second. As data comes into the pipeline, analysis is performed and results are generally available within seconds or even milliseconds. Processing continuous data in-stream requires technology that can handle this. Technologies like Lambda are available to provide this type of stream processing. Lambda, incidentally, supports both batch and stream processing. Stream processing or even batch processing, for that matter, will often lead to an eventual ETL process or an ELT process. ETL refers to extract, transform, and load, while ELT refers to extract, load, and transform. Both concepts refer to the process of obtaining data, extracting it from its source, and then transforming it. The transformed data is then stored for analysis later. ETL transforms the data before loading and storing it. ELT loads collected data before transforming it, which usually means data can be handled at greater scale.
In a hybrid processing data flow scenario, data that's processed, used, and stored is generally distributed among cloud and on-prem systems. As such, the data flow itself will often travel from on-prem to the cloud and maybe even vice versa. As such, it's important to think about bandwidth, especially when dealing with a data flow that moves data from on-prem up to the cloud. It's important to assess the size of the pipe, the latency, and the cost implications of data ingress and egress to and from the cloud. A common hybrid scenario is one in which data is ingested from applications and systems that rely on-prem and then stored and processed, and maybe even transformed, in the cloud. When designing a hybrid data flow that sources data from on-prem devices, application, forms, or really any other type of on-prem source, a key consideration is how the connection from the on-prem environment to the cloud is constructed. While a site-to-site VPN might be sufficient, latency issues may dictate that an ExpressRoute be considered instead. Whether you use ETL or ELT, both patterns are commonly used to form most data flows in production. As mentioned previously, the ETL or extract, transform, and load process, begins with extracting data from one or more systems. The extracted data might be in the same format or it can be in all different formats. Once the data is extracted, it's transformed into some usable format, typically in memory. The data is then stored, so that it can be queried later. As such, ETL is considered a schema on write because the data is first transformed into some standard format or schema before it's written to storage. The ELT process or extract, load, and transform process, essentially replicates data to storage as-is. Only after the data is written to storage is it transformed into a usable format.
This is referred to as schema on read because there is no schema enforced on the data during initial ingestion. Instead, the data is transformed after it's been stored and while it's being used. When considering ETL vs ELT, the main driver generally comes down to scale. Using ETL requires data to be transformed before it can be loaded. This means that there's lots of compute power needed. As such, this can negatively impact the ability to process large amounts of data. Using ELT, instead, separates data ingestion from the transformation process. What this does is allow huge amounts of data to be ingested and loaded before it's transformed. By breaking the process apart in ELT, it's possible to ingest lots more data than with ETL. You can essentially ingest data as fast as it's written. Ultimately, if scale is a concern, ELT is the preferred strategy over ETL.
QA: A world-leading tech and digital skills organisation
We help many of the world’s leading companies to build their tech and digital capabilities via our range of world class training courses, reskilling bootcamps, work-based learning programmes and Apprenticeships. We also create bespoke solutions, blending elements to meet specific client needs.