Designing Data Flows in Azure
Data Flow Basics
Designing a Data Flow Solution
The course is part of these learning pathsSee 2 more
This Designing Data Flows in Azure course will enable you to implement the best practices for data flows in your own team. Starting from the basics, you will learn how data flows work from beginning to end. Though we do recommend an idea of what data flows are and how they are used, this course contains some demonstration lectures to really make sure you have got to grips with the concept. By better understanding the key components available in Azure to design and deploy efficient data flows, you will be allowing your organization to reap the benefits.
This course is made up of 19 comprehensive lectures including an overview, demonstrations, and a conclusion.
- Review the features, concepts, and requirements that are necessary for designing data flows
- Learn the basic principles of data flows and common data flow scenarios
- Understand how to implement data flows within Microsoft Azure
- IT professionals who are interested in obtaining an Azure certification
- Those looking to implement data flows within their organizations
- A basic understanding of data flows and their uses
Related Training Content
For more training content related to this course, visit our dedicated MS Azure Content Training Library.
When designing data flows, Azure SQL Data warehouse is likely to be situated at the end of the line, where ingested and processed data is stored, so that it can be later analyzed. When processed data is ready for analysis, SQL Data Warehouse uses PolyBase to query the big data stores where the processed data lives and uses standard T-SQL queries to pull the data into SQL Data Warehouse. This scaled-out version of SQL server focuses on providing Enterprise Data Warehouse functionality by leveraging massively parallel processing or MPP. While it does provide SQL compatibility, Enterprise Data Warehouse doesn't do everything that SQL can do. For example, Azure SQL data warehouse doesn't support foreign key constraints. That said, it's not really meant to be used like SQL Server is. What SQL Data Warehouse does is store data into relational tables using columnar storage. Doing so vastly reduces the data storage costs while improving query performance. Data that is stored in SQL Data Warehouse can be analyzed at massive scale. When compared to other, traditional, database systems, SQL Data Warehouse analysis queries finish in seconds or minutes, instead of hours or days. When designing a data warehouse with Azure SQL Data Warehouse, begin by loading data into a service such as Azure Data Lake Store or Azure Blob storage. From there, you'll want to use PolyBase to load the data into a staging table in SQL Data Warehouse. Next, choose a table distribution strategy. You can configure distributed tables or replicated tables.
A distributed table appears as a single table. However, its rows are actually stored across 60 different distributions and they are distributed with a hash or round-robin algorithm. A replicated table, however, makes full copy of the table accessible on each Compute node, of which there can be as many as 60. By replicating a table, you remove the need to actually transfer data among the compute nodes prior to a join or aggregation. Replicated tables work best in cases where the table size is less than two gig compressed. Next, you'll need to index your table so it can be read quickly. You can leverage one of three types of indexing depending on your needs. The heap type of index is good for staging and temporary tables, as well as small tables with small lookups.
A Clustered Index is better for tables with up to 100 million rows and for tables with more than 100 million rows, but with only one or two columns that are heavily used. A Clustered Columnstore Index, or CCI, is best for large tables with more than 100 million rows. You might also want to partition a table when a large fact table is involved. A large fact table would be one with more than 1 billion rows. In almost all cases, a partition key should be based on date. It's important to note, however, that you do not over-partition, especially in cases where you are using a clustered columnstore index. While it's not necessary, incrementally loading the data minimizes disruption to business users. If you do choose to incrementally load data, ensure that you allocate larger resource classes to loading your data. Microsoft recommends using PolyBase and ADF V2 for automating ELT pipelines into SQL Data Warehouse. It's important to remember that if you know in advance what the primary operations and queries are going to be run in the data warehouse, you can build a design that prioritizes the data warehouse architecture for such operations.
About the Author
Tom is a 25+ year veteran of the IT industry, having worked in environments as large as 40k seats and as small as 50 seats. Throughout the course of a long an interesting career, he has built an in-depth skillset that spans numerous IT disciplines. Tom has designed and architected small, large, and global IT solutions.
In addition to the Cloud Platform and Infrastructure MCSE certification, Tom also carries several other Microsoft certifications. His ability to see things from a strategic perspective allows Tom to architect solutions that closely align with business needs.
In his spare time, Tom enjoys camping, fishing, and playing poker.