image
Determining Data Flow Requirements
Start course
Difficulty
Intermediate
Duration
1h 5m
Students
2163
Ratings
4.6/5
Description

In this course, we're going to review the features, concepts, and requirements that are necessary for designing data flows and how to implement them in Microsoft Azure. We’re also going to cover the basics of data flows, common data flow scenarios, and what all is involved in designing a typical data flow.

Learning Objectives

  • Understand key components that are available in Azure that can be used to design and deploy data flows
  • Know how the components fit together

Intended Audience

This course is intended for IT professionals who are interested in earning Azure certification and for those who need to work with data flows in Azure.

Prerequisites 

To get the most from this course, you should have at least a basic understanding of data flows and what they are used for.

Transcript

Before designing a data flow, you need to determine what data flow requirements there are. Determining data flow requirements requires you to ask, and answer, lots of questions. You need to ask lots of questions up front because when you transform your data, you are often going to keep only the information that you care about, dumping the rest. If you don’t think everything through, you might very well dump data that you don’t care about now but might care about later down the road.

In this lecture, we’re going to discuss some of the key questions and considerations when determining data flow requirements. So, what questions need to be asked?

Well, for starters, you need to understand the source or sources of the data and what the data is, exactly. You’ll need to determine what format the data is in.  Is the data coming from another database? Is it XML? JSON? Is it unstructured or images?

After sorting out the source and format of the data, you need to identify any security-related issues. You need to identify who needs to access the data and what they can do with it. Access controls will need to be kept in place throughout the lifecycle of the data. This is most important when pulling data from multiple sources, because when you do, the source data is likely protected. Dumping all this data into a shared repository can suddenly make data available to people that should have access to it. As such, it’s critical that you understand the security requirements of the source data, and that you retain that security once the data has been collected.

Along with security considerations, you also need to account for auditing requirements. You need to determine if there are and auditing or logging requirements for the collected data. If there are auditing requirements on the data, you might be able to use metadata, such as creation data, batch data, modified dates, and such, to track the data as it’s loaded. Logging requirements, if there are any, can be addressed with operation counts, insertion counts, execution times, and success or failure information for data loads.

Storage retention requirements are also an important consideration.

When designing a data flow, you need to determine if you need to, or want to, store the data forever. You also need to determine if the data needs to be stored in its raw format. The answer to this question is often “yes” because if you only keep data that’s been transformed or processed, you may find yourself with unanswerable business questions later, because the data that answers them was trimmed during the processing stage. 

You’ll also want to determine whether there are retention policies that affect how long data is retained. Are there certain policies/laws that require data to be retained indefinitely or for a specific amount of time? It’s an important consideration. Another important consideration is GDPR. Does GDPR apply? Is there certain data that needs to be removed? Is there a requirement that certain data be removable in the future? All important questions to ask, concerning data storage as it relates to designing a data flow.

Another key consideration when designing data flows is metadata. If you are going to be analyzing the data eventually, what metadata, if any, is necessary to facilitate that analysis? Additionally, you should consider what business intelligence tools are going to be used to analyze the data – because this will often determine, in some part, how the data is stored and in what format it is stored in.

As you work through all the data flow design requirements, it’s important that you also consider existing skillsets that are available. Because there are numerous technologies that come into play when designing and deploying a data flow, it makes sense to consider whether you have the in-house skillsets to support what you are designing.

Think about it, if you are going to design a data flow that leverages Python or T-SQL, what good is it if there is nobody in the organization that knows Python or T-SQL? When deciding on what technologies to leverage in your data flow design, think about your team and what skillsets they possess. You might also consider using pre-existing packages that you already have. For example, if you are already using SQL Analysis services and SSIS packages, you probably already have data flows and controls defined. It might make more sense to just bring those pre-existing solutions to the cloud, instead of reinventing the wheel with technologies that are foreign to your team.

 

As you define all requirements for your data flow design, think about interdependencies of all the different components. This is important because you are going to need to make all these interdependent technologies work together – similar to that of an orchestrator of a symphony. If you can’t make the chosen technologies work together, you have no data flow. 

About the Author
Students
90312
Courses
89
Learning Paths
56

Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.

In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.

In his spare time, Tom enjoys camping, fishing, and playing poker.