Data protection is probably the central area of concern regarding system security. The proliferation of online systems means tension between data privacy and usability. The key to a usable but safe data environment is knowing what level of protection needs to be applied to different data, that is, how to classify data. In the past, this has been a predominantly manual and subjective exercise. As data volumes have expanded exponentially, there is a need for automated data classification systems. This course looks at the data classification technologies available through the Microsoft 365 compliance portal.

Learning Objectives

  • Overview document protection and data classification
  • Learn how to create a sensitive information type
  • Learn how to implement Exact Data Matching
  • Learn about trainable classifiers
  • See how to view classified data with Content Explorer

Intended Audience

  • Students working towards the MS-101 Microsoft 365 Mobility and Security exam
  • Those wanting to learn about data classification and how it's implemented in the Microsoft 365 compliance environment



While this course is titled classifying data, it's probably more correct to think of it as classifying and labeling information. Typically, data is content agnostic - at a binary level, it has no meaning. That is, we end up protecting data based on its location after classifying its contents. In this way, "special" data is treated as any other precious item. We put it in a "special" place like a safe or a vault. While this is a tried-and-true data protection method, it has several significant drawbacks, especially when dealing with large volumes of data. First of all, it's completely manual. Someone has to view the data, decide its classification, then the data classifier moves it to a "special" location. Here we hit another issue.

Do we have enough vaults or locations for the different protected data types? Or will we just mix and mingle sensitive data and put it all in the same place? You're sensitive, you're sensitive, and you're sensitive. I don't care that you're sensitive for different reasons – just get in there. Another issue with protected locations or vaults is that it is an all-or-nothing affair. Once the data is in the protected location, users need access to the location to view it, which is the whole point of the exercise – right? Well, yes and no. Sometimes you want the protection to be a little bit porous. What do I mean by that? Data masking, for example, credit card masking, is a good example of this. Enough of the credit card number is visible to a call center operator to confirm whether it belongs to a user without exposing all the credit card details.

Credit card masking exemplifies the next step in the evolution of data classification. Classification can be automated based on the data's format. So that's things like credit cards, phone numbers, social security numbers, etc. This removes much of the manual leg work for easily identified data types that conform to well-defined formats. 

Microsoft 365 and Azure have greatly expanded data classification and, by extension, data protection through two new technologies. The first is labeling. Instead of using a file's location to signify something special about it, Microsoft uses a label embedded in the document, file, or email. Unlike file attributes linked to the file system, using an embedded label means the document's "sensitivity" becomes an intrinsic attribute. If you're anything like me, you don't believe in magic and may be wondering how labeling technically works. Office documents and many other Microsoft document types are stored as XML files. Labels are embedded as XML nodes within the file. 

The other significant step forward for data classification is using predefined information formats and machine learning. Out-of-the-box Microsoft 365 includes many predefined sensitive information types. 

These sensitive info types cover many common information parsing scenarios you might encounter, like the aforementioned credit card, but also addresses, driver's licenses, and even medical terms. 

Sensitive information types are really just another way of classifying data based on format. In fact, the definition of the address information type says as much. "The primary resources are the patterns of address formats used in a given country." Please take notice of the mapping confidence level. Each of these sensitive information types has a confidence level, which I'm assuming is a way for Microsoft to say, "in our experience, this data classification based on formatting will be right x percent of the time." You might think medium isn't great, but speaking from personal and painful experiences cleaning up addresses in customer databases over the years, medium seems about right. 

The next iteration in automatic data classification is trainable classifiers. Trainable classifiers use machine learning to train the data classification algorithm in recognizing complex data types. For those unfamiliar with machine learning, it involves teaching a software model to recognize a particular thing, like getting a computer to recognize images of cats. A machine learning model is a combination of an algorithm and data. The model is built in three steps — teaching, training, and refinement. Using the cat scenario, the teaching phase tells the model what you are looking for by showing it a small but representative sample set of cat images. The training phase involves many more samples of cat images, some that aren't cats and some that could be interpreted as cats. 

The model classifies the images as a cat, not a cat, or ambiguous. The model's predictions are then scored as correct or not. At this stage, the machine learning model can be published for use. Realistically models aren't going to be 100% correct all of the time, so the final maintenance phase involves ongoing refinement of the model, correcting, and re-training as required. Azure and Google both have image recognition services, and a computer's ability to correctly identify a picture of a kitten seems fantastic. The thing is, the image recognition algorithms have been trained on or exposed to millions of examples of cats with a few dogs thrown in and rewarded for correct answers — kind of like Pavlov's dog training in the 21st century. As smart as computers may seem, they don't know anything, and it's the brute force approach of big data and machine learning that gives the software the veneer of intelligence.

Luckily you don't have to start from scratch training data classifiers. There are pre-trained types you can use as is or as starting points to be trained further. These pre-trained classifiers include a range of documents, from invoices and financial reports through to employment, IP, and legal agreements. Images such as adult, racy, gory, and harassment can also be classified. There is a caveat with image classification. If you recall, I said labels are embedded in a document with an XML node. Jpegs, PNGs, GIFs, and BMP image files aren't in XML format. Only images exchanged or sent through email, and Microsoft Teams channels are subject to labeling, as the enclosing message is tagged rather than the image itself.

About the Author
Learning Paths

Hallam is a software architect with over 20 years experience across a wide range of industries. He began his software career as a  Delphi/Interbase disciple but changed his allegiance to Microsoft with its deep and broad ecosystem. While Hallam has designed and crafted custom software utilizing web, mobile and desktop technologies, good quality reliable data is the key to a successful solution. The challenge of quickly turning data into useful information for digestion by humans and machines has led Hallam to specialize in database design and process automation. Showing customers how leverage new technology to change and improve their business processes is one of the key drivers keeping Hallam coming back to the keyboard.