Training a Classifier
Start course

Data protection is probably the central area of concern regarding system security. The proliferation of online systems means tension between data privacy and usability. The key to a usable but safe data environment is knowing what level of protection needs to be applied to different data, that is, how to classify data. In the past, this has been a predominantly manual and subjective exercise. As data volumes have expanded exponentially, there is a need for automated data classification systems. This course looks at the data classification technologies available through the Microsoft 365 compliance portal.

Learning Objectives

  • Overview document protection and data classification
  • Learn how to create a sensitive information type
  • Learn how to implement Exact Data Matching
  • Learn about trainable classifiers
  • See how to view classified data with Content Explorer

Intended Audience

  • Students working towards the MS-101 Microsoft 365 Mobility and Security exam
  • Those wanting to learn about data classification and how it's implemented in the Microsoft 365 compliance environment



I intended this lecture to be a demonstration, but for a couple of reasons, that's not going to be the case. Microsoft says you need an E5 subscription to use the trainable classifier's functionality, but you can use the Compliance E5 add-on trial instead, which is the option I choose. When I installed the add-on and went into trainable classifiers, there was a message saying it would take 7 to 14 days to assess my OneDrive, Exchange, and Teams content. 

The create a trainable classifier button was disabled while my files were being assessed using the predefined classifiers. After about ten days, the process had finished, and the create a trainable classifier button was enabled. However, when I went to create a trainable classifier,  Microsoft Purview told me I needed the compliance add-on I already had. This issue is now with Microsoft support. As we'll see, creating a trainable classifier takes a long time in a best-case scenario, so a live demonstration may never have been realistic. Let's press on with the non-demo demonstration, where I show you the steps for creating a trainable classifier. From a user interface point of view, the only things missing are filling in a name and description field and selecting a SharePoint folder

Say you're working in a charitable organization that helps dogs with gambling addiction. You need to create a dog gambling classifier to determine if any dogs working in the charity are passing triggering images around. The process for teaching Microsoft 365 to recognize gambling dogs or any other complex and ill-defined data type is the same as for most machine learning exercises. You first need to define the machine learning model, and then you train the model.

The first step is to collate 50 to 500 good examples of the content you want to identify. This is referred to as positive seed data. Positive, because they are examples of what you're looking for, and seed, as they are the initial starting point for the machine learning process. The seed data must be placed in an online SharePoint folder so it can be crawled by the machine learning algorithm. With the seed data in place, create the classifier, giving it a name and a description. Next, select the SharePoint location where the seed data is stored. This involves adding the SharePoint site, then selecting the folder containing the seed data. Once you've added the seed location, click next and then click Create trainable classifier. Processing the seed data typically takes up to two hours, but Microsoft says allow up to 24 hours for seed processing to complete. 

Now that you've told Microsoft 365 what your classifier "looks" like, you need to train the model. The training step involves more positive samples but different ones than the seed data, and negative samples, so examples of what the classification data is not. When it comes to training the model, the more samples, the better. In fact, Microsoft say, "prepare up to 10,000 positive and negative samples." As with the seed data, you need to place the samples in a SharePoint folder, so it can be crawled by the ML model. 

The training folder must not be the seed folder. After the model has evaluated the sample data as being relevant to the classifier you're trying to define, you need to review how the model has classified the samples. You "tell" the model whether you agree or disagree with the model's assessment of each sample datum. The classifier accuracy score will dynamically update for every 30 file predictions you review. Once you've finished reviewing your sample data, or you think you've reviewed enough, after all, reviewing up to 10,000 samples is no small task, you can analyze the review process. 

The analysis will give you a classifier accuracy score and tell you if the model is stable enough to be published for use. Two points I want to emphasize here. You don't need to have 10,000 training samples. As I said the more the better, but sample quality and diversity is just as important. Just to be clear, by diversity, I mean samples on a spectrum from very good examples of the classifier, samples that could be interpreted either way, to samples that are clearly not the classifier. The other point is that accuracy isn't cut and dry, but measured as a probability, expressed as a percentage, whether the model's prediction is correct.

About the Author
Learning Paths

Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a  Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.