Registering Source and Scanning Metadata
Start course

In this course, users will explore the suite of tools available in Microsoft Purview for registering and scanning data sources, connecting a business glossary, searching the data catalog, and customizing metadata with enrichments and classifications. In addition, this course will review some of the management and administrative functionality in Purview, including creating roles, managing authorizations, and using the Apache Atlas API for custom implementations. This course will also review deployment best practices and network security considerations. By completing this course, users will have a strong understanding of the suite of functionality currently available in Purview and how these tools support a larger governance initiative within an organization.  

Learning Objectives

  • Provision and install Microsoft Purview
  • Create and manage a role
  • Register and scan data sources
  • Create a business glossary
  • Enrich metadata with classifications
  • Review data lineage tooling
  • Understand deployment best practices
  • Take network security considerations into account

Intended Audience

This course is designed for individuals who are responsible for setting up, monitoring, or exploring data catalog and governance programs within their organization.  


To get the most from this course, you should have some familiarity and experience with governance tooling as well as a basic understanding of the Azure portal.


Register Sources and Scan Metadata. Purview supports registering a number of Azure based and non-Azure based sources. These sources are added to collections which are used to organize assets and sources by our businesses' flow. They're also the tool used to manage access across Microsoft Purview. Here are some of the data sources currently available in Purview. To register a source, select register from the data map, then select an available source. 

This will add the source to a selected collection. After data sources are registered in our Microsoft Purview account, the next step is to scan the data sources. The scanning process establishes a connection to the data source and captures technical metadata like names, file size, columns, and so on. It also extracts schema for structured data sources, applies classifications on schemas, and applies sensitivity labels if our Purview data map is connected to a Purview compliance portal. The scanning process can be triggered to run immediately or can be scheduled to run on a periodic basis to keep our Purview account up to date. Microsoft Purview is secure by default. 

No passwords or secrets are stored directly in the program, so we'll need to choose an authentication method for our sources. There are a few possible ways to authenticate our Purview account, but not all methods are supported for each data source. These authentication pathways include Managed Identity, Service Principal, SQL Authentication, Account Key, or Basic Authentication. When scanning a source, we have a choice to scan the entire data source or choose only specific entities like folders or tables to scan. Available options depend on the source we're scanning and can be defined for both one-time and scheduled scans. A scan rule set determines the kinds of information a scan will look for  when it's  running against one of our sources. 

Available rules depend on the kind of source we're scanning but include things like the file types we should scan and the kinds of classifications we need. There are over 200 built-in system classifications that can be selected as well as the ability to create custom classifications. For a catalog to know if a file, table, or container was deleted, it compares the last scan output against the current file scan output. For example, suppose that the last time we scanned an Azure Data Lake storage gen2 account, it included a folder named Folder 1. When the same account is scanned again, Folder 1 is missing. Therefore, the catalog assumes the folder has been deleted.


About the Author

Steve is an experienced Solutions Architect with over 10 years of experience serving customers in the data and data engineering space. He has a proven track record of delivering solutions across a broad range of business areas that increase overall satisfaction and retention. He has worked across many industries, both public and private, and found many ways to drive the use of data and business intelligence tools to achieve business objectives. He is a persuasive communicator, presenter, and quite effective at building productive working relationships across all levels in the organization based on collegiality, transparency, and trust.