Classifying & Protecting Data


Start course
1h 9m

Amazon Macie was launched in the summer of 2017, much to the delight of cloud security engineers. Amazon Macie is a powerful security and compliance service that provides an automatic method to detect, identify, and classify data within your AWS account. Macie currently supports Amazon S3 storage, however additional support for other storage systems will be developed and added over time. Backed by machine learning, Macie can actively review your data as different actions are taken within your AWS account. Machine learning spots access patterns and analyzes user behaviour using CloudTrail event data to alert against any unusual or irregular activity. Any findings are presented within a dashboard which can trigger alerts allowing you to quickly resolve any potential threat of exposure or compromise to your data.

This course will dive into all elements of the service, discussing its many different features and customizable elements allowing you to gain the maximum potential of its ability.

Learning Objectives

By the end of this course you will be able to:

  • Provide an understanding and awareness of what Amazon Macie is and what it’s used for
  • Provide an explanation of each configurable component of the service to allow you to gain maximum benefit from Macie’s capabilities
  • Understand how the service can provide a customizable approach to maintaining compliance
  • Understand how through automation and machine learning Amazon Mazie detects and categorizes S3 content to detect potential security threats and exposures

Intended Audience

The content of this course is centered around security and compliance. As a result, this course is beneficial to those who are in the roles or their equivalent of:

  • Cloud Security Architects
  • Compliance Managers
  • Cloud Administrators
  • Cloud Support & Operations


As a prerequisite of this course you should have an understanding and awareness of:

  • Amazon S3
  • AWS CloudTrail



Resources Referenced

Regular Expressions (Regex)

Lecture Transcript

Hello and welcome to this lecture where I'll be explaining how Amazon Macie makes its decisions on data classification through AWS CloudTrail logs and Amazon S3 actions. 

Data being stored on Amazon S3 within your AWS account is classified by Macie which determines its level of business sensitivity and criticality. Every data object within your Amazon S3 buckets automatically receives a perceived level of risk based on this classification process. The data values depicted within the dashboard, discussed earlier, are all driven from this classification and risk assessment. So what are these categories of classification that Amazon Macie uses? 

There are four categories for classification, which can be found under the settings menu within the Amazon Macie console. These being content type, file extensions, themes, and regex. The classifications within these categories can not be ordered or modified in any way. Neither can you add additional entries within each of these classifications. 

Content type. The content type classification allows Macie to detect the type of file that is being stored on S3. For example, a binary file, a document, or source code object. Amazon Macie will then embed an identifier in the header of the file for classification. If you look at the different content types available, you will notice that there is a long list of types and every entry has the following fields. Name, description, classification, risk, and enabled. The first two fields are obvious. The classification field actually specifies the content type of that type of file. For example, an Adobe Illustrator file is classified as a document and a WireShark packet capture is classified as binary. The risk is a value between 1 to 10 and defines the business risk value of that type of content. Respectively, Adobe Illustrator files have a risk value of one and the WireShark files have a risk value of six. Finally, you can choose to have the content type entry enabled or disabled. If it's enabled and active, the value will read yes. If it's disabled the value will read no. This setting can be changed by selecting the entry and making the change. This is the only value that you can change on the content types. 

File extensions. The file extension classification looks at the file extension of the object to ascertain its risk value. The same field types are used with file extensions as is for content type we just discussed. 

Themes. Themes operate differently to both content type and file extensions in the fact that they assess the object based upon a series of key words that are detected within the actual object itself. Depending on these key words and their combinations will determine the risk level assigned to the object. The field types for themes are theme title, minimum keyword combinations, risk, and enabled. Examples of these titles are 'American Express Credit Card Keywords' or 'Audit Keywords', allowing you to determine the types of words that are being assessed. You can click on any of these entries to look at the actual words that are being scanned for. In the case of 'Audit Keywords', these are audit, risk assessment, security, and evaluation. The minimum keyword combinations is a numerical value showing how many of these keywords must be present in the object to dictate the risk level. 

Regex. Like themes, regex or regular expression classifies content based on the actual content within the object. These regular expressions contain a text string for describing a specific search pattern allowing Amazon Macie to look for specific data within the content to calculate its risk. If you would like to learn and understand more about regex, then you can look at the link here. For each object stored on S3, Amazon Macie will assign a content type, file extension, theme, and regex value before defining its final risk value. This is determined by the highest value that was detected in each of these categories. For example, you may have an Excel document containing UK Passport numbers and the risk values may be classified as follows with a content type, file extension, and theme all with a risk value of one and a regex value of five. This would give a result of five as that was the highest value obtained for that data object. 

During Amazon Macie's process to classify data, it also performs automatic PII classification. This uses a list of predefined metrics relating to PII which include the following and are assigned a low, moderate, or high rating which is dependent on the quantities found within the object. The PII data searched includes full names, mailing addresses, email addresses, credit card numbers, IP addresses, driver license IDs, national identification numbers, and birth dates. 

Amazon Macie uses two methods for protecting your data using AWS CloudTrail and artificial intelligence and machine learning to assess and review historical patterns of access. Using this historical data provided by CloudTrail, Macie can detect if there is unusual behavior occurring within your account that could potentially lead to your data being compromised. These methods include the use of AWS CloudTrail events and AWS CloudTrail errors. Both of these can be accessed by the settings menu within the Macie console, along with the data classification categories. 

CloudTrail events provides a list of CloudTrail events along with their associated risk value of the API. The fields include name, description, classification, risk, and enabled. The classification field relates to the resource that a particular API is actionable against. It's also possible to search the events by name. As you can see in the image, I've searched for get, which as returned all events with get in the title. The image shows just three of the results returned. As expected, an API starting with get is likely to have a higher risk value then an API starting with list, as a get is generally a request to retrieve data which could indicate an intrusion of some kind depending on the API in question. These three get events have a rating of eight, eight, and nine, which again is marked out of 10. 

CloudTrail errors. This looks at the different errors that are generated and reported within CloudTrail. If you perform an action within AWS and receive an error back, it is generally because you did not have permission or access to the resource, you were using the wrong credentials, or some other kind of invalid request. These are all common errors that can occur when someone is trying to access or perform a function or action against something that they shouldn't be. So from a security awareness and assessment point of view, these are crucial. As a result, the minimum risk value to these errors is five, with some reaching a risk value of 10, the highest possible value. 

That now brings me to the end of this lecture covering the classification and protection of data within Amazon Macie.

About the Author
Learning Paths

Stuart has been working within the IT industry for two decades covering a huge range of topic areas and technologies, from data center and network infrastructure design, to cloud architecture and implementation.

To date, Stuart has created 150+ courses relating to Cloud reaching over 180,000 students, mostly within the AWS category and with a heavy focus on security and compliance.

Stuart is a member of the AWS Community Builders Program for his contributions towards AWS.

He is AWS certified and accredited in addition to being a published author covering topics across the AWS landscape.

In January 2016 Stuart was awarded ‘Expert of the Year Award 2015’ from Experts Exchange for his knowledge share within cloud services to the community.

Stuart enjoys writing about cloud technologies and you will find many of his articles within our blog pages.