The Data Lifecycle | DAL4 A5.1 |
The concept of data flow involves transforming, storing, and analyzing data in order to answer questions about your organization. It helps you to understand where your organization is performing well, and where you could improve. It can also give you an insight into what the future of your organization might look like.
In this video course, you'll learn the basics of data flow, including the data lifecycle, and look at some common data flow scenarios.
Before designing a data flow, you need to determine what data flow requirements there are. Determining data flow requirements requires you to ask, and answer, lots of different questions. You need to ask lots of questions up front because when you transform your data, you are often going to keep only the information that you care about, dumping the rest. If you don't think everything through, you might very well dump data that you don't care about now but might care about later down the road. In this lecture, we're going to discuss some of the key questions and considerations when determining data flow requirements. So, what questions need to be asked? Well, for starters, you need to understand the source or sources of the data and what the data is, exactly. You'll need to determine what format the data is in and if it's coming from another database? Is it XML? Is it JSON? Is it unstructured, images? As you can see, lots of questions about data format. After sorting out the source and format of the data, you need to identify any security-related issues. You need to identify who needs to access the data and what they can do with it. Access controls will need to be kept in place throughout the lifecycle of the data. This is most important when pulling data from multiple sources, because when you do, the source data is likely protected.
Dumping all this data into a shared repository suddenly makes data available to people that might not need access to it. As such, it's critical that you understand the security requirements of the source data, and that you retain that security once the data has been collected. Along with security considerations, you also need to account for auditing requirements. You need to determine if there are auditing or logging requirements for the collected data. If there are auditing requirements on the data, you might be able to use metadata, such as creation data, batch data, modified dates, and such, to track the data as it's loaded. Logging requirements, if there are any, can be addressed with operation counts, insertion counts, stuff like execution times, and even success or failure information for data loads. Storage retention requirements are also an important consideration. When designing a data flow, you need to determine if you need to, or want to, store the data forever. You also need to determine if the data needs to be stored in its raw format. The answer to this question is often yes because if you only keep data that's been transformed or processed, you may find yourself with unanswerable business questions later, because the data that answers those questions was trimmed during the processing stage. You'll also want to determine whether there are retention policies that affect how long data is retained. Are there certain policies or laws that require data to be retained indefinitely or for a specific amount of time? It's an important consideration.
Another important consideration is GDPR. Does GDPR even apply? Is there certain data that needs to be removed? Is there a requirement that certain data be removable in the future? All important questions to ask. Another key consideration when designing data flows is metadata. If you are going to be analyzing the data eventually, what metadata, if any, is necessary to facilitate that analysis? Additionally, you should consider what business intelligence tools are going to be used to analyze that data because this will often determine, in some part, how that data is stored and in what format it's stored in. As you work through all the data flow design requirements, it's important that you also consider existing skillsets that are available. Because there are numerous technologies that come into play when designing and deploying a data flow, it makes sense to consider whether you have the in-house skillsets to support what you're designing.
Think about it, if you're going to design a data flow that leverages Python or T-SQL, what good is it if there's nobody in the organization that knows Python or T-SQL? When deciding on what technologies to leverage in your data flow design, think about your team and what skillsets they possess. You might also consider using pre-existing packages that you already have. For example, if you're already using SQL Analysis services and SSIS packages, you probably already have data flows and controls defined. It might make more sense to just bring those pre-existing solutions to the cloud, instead of reinventing the wheel with technologies that are foreign to your team. As you define all requirements for your data flow design, think about interdependencies of all the different components. This is important because you're going to need to make all these interdependent technologies work together, similar to that of an orchestrator of a symphony. If you can't make the chose technologies work together, you have no data flow.
QA: A world-leading tech and digital skills organisation
We help many of the world’s leading companies to build their tech and digital capabilities via our range of world class training courses, reskilling bootcamps, work-based learning programmes and Apprenticeships. We also create bespoke solutions, blending elements to meet specific client needs.