Exploring Data Lineage
Start course

In this course, users will explore the suite of tools available in Microsoft Purview for registering and scanning data sources, connecting a business glossary, searching the data catalog, and customizing metadata with enrichments and classifications. In addition, this course will review some of the management and administrative functionality in Purview, including creating roles, managing authorizations, and using the Apache Atlas API for custom implementations. This course will also review deployment best practices and network security considerations. By completing this course, users will have a strong understanding of the suite of functionality currently available in Purview and how these tools support a larger governance initiative within an organization.  

Learning Objectives

  • Provision and install Microsoft Purview
  • Create and manage a role
  • Register and scan data sources
  • Create a business glossary
  • Enrich metadata with classifications
  • Review data lineage tooling
  • Understand deployment best practices
  • Take network security considerations into account

Intended Audience

This course is designed for individuals who are responsible for setting up, monitoring, or exploring data catalog and governance programs within their organization.  


To get the most from this course, you should have some familiarity and experience with governance tooling as well as a basic understanding of the Azure portal.


Exploring Data Lineage. Data lineage is broadly understood as the lifecycle that spans from data's origin to where it moves over time across the data estate. It is used for different kinds of backwards-looking scenarios such as troubleshooting, tracing route cause, and data pipelines, and debugging. Lineage is also used for data quality analysis, compliance, and 'what if' scenarios, often referred to as impact analysis. 

Lineage is represented visually to show how it moves from source to destination, including how the data was transformed. Given the complexity of most enterprise data environments, these views can be hard to understand without doing some consolidation or masking of peripheral data points. Microsoft Purview data catalog will connect with other data processing, storage, and analytics systems to extract lineage information. The information is combined to represent a generic scenario specific lineage experience in the catalog. 

A data estate may include systems doing data extraction, transformation, analytics, and visualization systems. Each of these systems captures rich, static, and operational metadata that describes the state and quality of the data within the system's boundary. The goal of lineage in the data catalog is to extract the movement, transformation, and operational metadata from each data system at the lowest grain possible. Lineage is represented as a graph. Typically, it contains source and target entities in data storage systems that are connected by a process invoked by a compute system. Data systems connect to the data catalog to generate and report a unique object referencing the physical object of the underlying data system such as SQL stored procedures or notebooks. 

High fidelity lineage with additional metadata like ownership is captured to show the lineage in a human readable format from source and target entities. Lineage is a critical feature of the Microsoft Purview data catalog to support quality, trust, and audit scenarios. The goal of a data catalog is to build a robust framework where all the data systems within our environment can naturally connect and report lineage. Once the metadata is available, the data catalog can bring together the metadata provided by data systems to power data governance use cases.


About the Author

Steve is an experienced Solutions Architect with over 10 years of experience serving customers in the data and data engineering space. He has a proven track record of delivering solutions across a broad range of business areas that increase overall satisfaction and retention. He has worked across many industries, both public and private, and found many ways to drive the use of data and business intelligence tools to achieve business objectives. He is a persuasive communicator, presenter, and quite effective at building productive working relationships across all levels in the organization based on collegiality, transparency, and trust.