Designing Data Flows in Azure
Data Flow Basics
Designing a Data Flow Solution
The course is part of these learning pathsSee 1 more
This Designing Data Flows in Azure course will enable you to implement the best practices for data flows in your own team. Starting from the basics, you will learn how data flows work from beginning to end. Though we do recommend an idea of what data flows are and how they are used, this course contains some demonstration lectures to really make sure you have got to grips with the concept. By better understanding the key components available in Azure to design and deploy efficient data flows, you will be allowing your organization to reap the benefits.
This course is made up of 19 comprehensive lectures including an overview, demonstrations, and a conclusion.
- Review the features, concepts, and requirements that are necessary for designing data flows
- Learn the basic principles of data flows and common data flow scenarios
- Understand how to implement data flows within Microsoft Azure
- IT professionals who are interested in obtaining an Azure certification
- Those looking to implement data flows within their organizations
- A basic understanding of data flows and their uses
Related Training Content
For more training content related to this course, visit our dedicated MS Azure Content Training Library.
Before designing a data flow, you need to determine what data flow requirements there are. Determining data flow requirements requires you to ask, and answer, lots of different questions. You need to ask lots of questions up front because when you transform your data, you are often going to keep only the information that you care about, dumping the rest. If you don't think everything through, you might very well dump data that you don't care about now but might care about later down the road. In this lecture, we're going to discuss some of the key questions and considerations when determining data flow requirements. So, what questions need to be asked? Well, for starters, you need to understand the source or sources of the data and what the data is, exactly. You'll need to determine what format the data is in and if it's coming from another database? Is it XML? Is it JSON? Is it unstructured, images? As you can see, lots of questions about data format. After sorting out the source and format of the data, you need to identify any security-related issues. You need to identify who needs to access the data and what they can do with it. Access controls will need to be kept in place throughout the lifecycle of the data. This is most important when pulling data from multiple sources, because when you do, the source data is likely protected.
Dumping all this data into a shared repository suddenly makes data available to people that might not need access to it. As such, it's critical that you understand the security requirements of the source data, and that you retain that security once the data has been collected. Along with security considerations, you also need to account for auditing requirements. You need to determine if there are auditing or logging requirements for the collected data. If there are auditing requirements on the data, you might be able to use metadata, such as creation data, batch data, modified dates, and such, to track the data as it's loaded. Logging requirements, if there are any, can be addressed with operation counts, insertion counts, stuff like execution times, and even success or failure information for data loads. Storage retention requirements are also an important consideration. When designing a data flow, you need to determine if you need to, or want to, store the data forever. You also need to determine if the data needs to be stored in its raw format. The answer to this question is often yes because if you only keep data that's been transformed or processed, you may find yourself with unanswerable business questions later, because the data that answers those questions was trimmed during the processing stage. You'll also want to determine whether there are retention policies that affect how long data is retained. Are there certain policies or laws that require data to be retained indefinitely or for a specific amount of time? It's an important consideration.
Another important consideration is GDPR. Does GDPR even apply? Is there certain data that needs to be removed? Is there a requirement that certain data be removable in the future? All important questions to ask. Another key consideration when designing data flows is metadata. If you are going to be analyzing the data eventually, what metadata, if any, is necessary to facilitate that analysis? Additionally, you should consider what business intelligence tools are going to be used to analyze that data because this will often determine, in some part, how that data is stored and in what format it's stored in. As you work through all the data flow design requirements, it's important that you also consider existing skillsets that are available. Because there are numerous technologies that come into play when designing and deploying a data flow, it makes sense to consider whether you have the in-house skillsets to support what you're designing.
Think about it, if you're going to design a data flow that leverages Python or T-SQL, what good is it if there's nobody in the organization that knows Python or T-SQL? When deciding on what technologies to leverage in your data flow design, think about your team and what skillsets they possess. You might also consider using pre-existing packages that you already have. For example, if you're already using SQL Analysis services and SSIS packages, you probably already have data flows and controls defined. It might make more sense to just bring those pre-existing solutions to the cloud, instead of reinventing the wheel with technologies that are foreign to your team. As you define all requirements for your data flow design, think about interdependencies of all the different components. This is important because you're going to need to make all these interdependent technologies work together, similar to that of an orchestrator of a symphony. If you can't make the chose technologies work together, you have no data flow.
About the Author
Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.
In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.
In his spare time, Tom enjoys camping, fishing, and playing poker.