The course is part of this learning path
In this course, we look at how to create custom sensitive information types with tools like Exact Data Match classification, keyword dictionaries, and document fingerprinting.
- Learn how to create and manage custom sensitive information types
- Learn how to use the Exact Data Match classifier
- Learn how to implement Document fingerprinting
- Learn how to create and utilize a keyword dictionary
- This course is designed for anyone looking to keep their data safe within Microsoft 365 using sensitive information types
- A basic understanding of sensitive information types and the key components
- Understanding Sensitive Information Types in Microsoft 365
There are two ways to create a custom sensitive information type. First, you could copy and adjust an existing built-in sensitive information type. This method copies the entire sensitive info type and as such you may need some finessing to make it function properly for your specific use case. However, in situations where your data you are classifying is completely different from existing classifiers, you can change and create your own custom sensitive information type from scratch. So, let's go over how to do both as they follow a similar path. If we wanted to create our own custom sensitive info type, we need to navigate to the data classification solution within Microsoft Purview. We then click on 'Sensitive info types' and we can see all of our current info types. To create our own, we simply click on 'Create sensitive info type' and it will bring us right into the creation tool.
So, let's say that our use case for this would be employee ID numbers. So, we can name this Employee ID Numbers and provide the description to text employee numbers for our organization. This way we can quickly see what this info type will cover. Once that's set, we can click 'Next' and be brought into the meat and potatoes of this process, the patterns. Patterns are the process of how the sensitive information type detects the information it's designed to detect. We're actually going to create multiple patterns for the same information type to differentiate confidence levels of found matches. So, let's start off by clicking 'Create pattern' and start creating our first pattern. Here we have the confidence level, the primary element, character proximity, supporting elements, and additional checks. If you are not familiar with these components, like I mentioned earlier, I have linked another course in the course materials section alongside documentation for you to review that provide a basis for everything that we're doing in this course.
We're going to make a low confidence pattern first. So, we choose low confidence and then add our primary element. As you can see there are a few options for primary elements, but since our Employee ID Numbers are a single letter followed by a dash and then six numbers, we're going to be using a regular expression. So, I'm going to put in my expression for my employee ID numbers which looks like this. Without getting into too much detail about regex, this depicts the format of our employee ID numbers: a single letter followed by a hyphen and then six digits. For more information on regular expressions, I have linked documentation in the course material section for you to review. But anyways, once that is set, we don't need to add any expression validators, so we click 'Done' and we're good to go. Since this is going to be a low confidence pattern, we won't be adding any supporting elements.
So, from here all we do is click 'Create' and the low confidence pattern has been set. Effectively, this pattern ensures that we get a positive match from anything that is formatted as a single letter followed by a hyphen, followed by six digits. While not necessarily likely, it is possible to get false positives, but it at least ensures that every single employee ID number will be found. But if we wanted to increase the likelihood of an accurate positive match, we can do that with higher confidence levels. For that, we simply need to add more supporting elements to the sensitive information type. So, just like we did before, we're going to add and click 'Create pattern', only this time instead of low confidence, we're going to make a medium confidence pattern. We can add the primary element only this time instead of having to type out our expression, we can click on existing regular expression, type in employee ID number, and click on that to automatically pull it in. To create a more accurate pattern, we now add supporting elements.
So, we click down here on 'Add supporting elements' and we're going to make a keyword list. Since we're looking for employee IDs, there is a high likelihood that whenever an employee ID number is found, it's likely to be near words like employee or ID, and as such, we can use a keyword list as supporting evidence for this sensitive info type. So, I can name this employee ID number keywords and start adding in words that may be found near an employee ID number. So, I'll put in words like employee, ID number, ID, and more to make sure we get a decent range of keywords. Also, it's worth noting that I entered this list in the case insensitive section. This is important as capitalization may vary across the organization and entering this here ensures that regardless of that capitalization, these keywords will be found. But with that, we can click 'Done' and move into proximity. I won't be adjusting this, but this segment of the pattern determines the proximity between the primary and supporting elements that determine a match.
Effectively, if the primary element is found, it then checks within 300 characters if supporting elements are also found to determine if it's indeed a positive match. You can change this anywhere in the document, but since it's incredibly likely that you'll see these words like employee and ID near the actual ID number, I think this proximity is plenty for this confidence level. So, we can click on 'Create' and we now have a medium confidence level pattern. We aren't going to create a high confidence as it's the same process with more supporting elements, so we're just going to click on 'Next' and choose the recommended confidence level for this info type when someone is making a compliance policy. I'm going to set this as medium for now, but we can always come back and change this if we need to. But with that, we hit 'Next', validate this information and hit 'Create'. And like that, we now have our employee ID sensitive information type. Now let's quickly test it to ensure the accuracy of the pattern. In order to test the pattern, we need to make a document with this information found.
So, I'm going to make a quick Word document with some fake ID numbers in it. You can see, here I have five ID numbers, each following the pattern of a single letter, followed by a hyphen, then followed by six digits. But since these are all actual ID numbers with no supporting evidence, the info type would only find a positive match with the low confidence pattern we created prior. To showcase how the medium confidence would work, I'm going to go ahead and add some random information here, and then put some more ID numbers with the label of employee IDs afterwards. If you recall, the supporting evidence we added included the word employee, so this should ping as a positive match for our medium confidence pattern. The reason I added some fluff here was for the proximity feature of the patterns. Had I just thrown in the word employee, all of these would be detected with the medium confidence pattern. But since I added more than 300 characters between the top section and the bottom section, the top section should identify as low confidence while the bottom section should identify as medium confidence. But now that we have that set up, we can save this document and test our info type. We can click into our employee ID numbers, sensitive information type, and go up in the right corner to click on 'Test'.
From here, we can upload a file to test this info type with, so we go ahead and upload our employee ID numbers test document and then click 'Test'. It will then take some time to process but once it completes, it will show you how many matches it found. This one shows 10 low confidence matches and five medium confidence matches, and even shows the supporting elements alongside the medium confidence matches. We can see that the supporting element was the word employee, but not the word ID. That's because we never made the word ID plural in our keywords list, so we should go back and adjust that so we can get a more accurate model. We don't need to do that just now as it's just a demo, but this is effectively the process for testing these info types. Once we're done, we hit 'Finish' and we can make any additional changes we need to the pattern or just leave it as is, and now we have a fully functioning sensitive information type for our employee ID numbers.
Lee has spent most of his professional career learning as much as he could about PC hardware and software while working as a PC technician with Microsoft. Once covid hit, he moved into a customer training role with the goal to get as many people prepared for remote work as possible using Microsoft 365. Being both Microsoft 365 certified and a self-proclaimed Microsoft Teams expert, Lee continues to expand his knowledge by working through the wide range of Microsoft certifications.