Who is Athena again? Athena is the Greek goddess of wisdom, craft, and war. (But at least she had a calm temperament, and only fought for a just cause!) This post is about Amazon Athena and about using Amazon Athena to query S3 data for CloudTrail logs, however, and I trust it will bring some wisdom your way.
Amazon Athena is an interactive, serverless query service that allows you to query massive amounts of structured S3 data using standard structured query language (SQL) statements. Athena is fast, inexpensive, and easy to set up. There is certainly some wisdom in using Amazon Athena, and you can get started using Athena by:
- Pointing to your S3 data
- Defining the schema
- Issuing SQL queries
Athena will take care of the rest. No need for massive extract, transform, load (ETL) jobs, or data migrations. Your data remains on S3. Maybe some things really are as easy as 1-2-3!
Using Amazon Athena to query S3 data: CloudTrail Logs
Amazon CloudTrail allows you to get a history of AWS API calls and related events. Some Amazon services are integrated with CloudTrail but do not log directly to the CloudTrail API activity history. I recently bumped into two such services: Amazon Key Management Service (KMS) and Simple Notification Services (SNS).
Practically speaking, it’s pretty much impossible. Once enabled, CloudTrail captures and logs an amazing amount of data to S3, especially if using several AWS services in multiple regions. Amazon Athena is an interactive query service that allows you to issue standard SQL commands to analyze data on S3. As implied within the SQL name itself, the data must be structured. It requires a defined schema. Fortunately, Amazon has a defined schema for CloudTrail logs that are stored in S3.
Defining the Schema for CloudTrail Logs
When you first start Athena, there is no database or tables that you can query. The CREATE EXTERNAL TABLE command shown below essentially defines a schema based on CloudTrail Record Contents.
Copy and paste the following SQL table creation statement into Athena’s Query Editor window (making the changes suggested below before running the query):
CREATE EXTERNAL TABLE IF NOT EXISTS cloudtrail_logs (
userIdentity STRUCT< type:STRING,
sessioncontext:STRUCT< attributes:STRUCT< mfaauthenticated:STRING,
resources ARRAY<STRUCT< ARN:STRING,
ROW FORMAT SERDE 'com.amazon.emr.hive.serde.CloudTrailSerde' STORED AS INPUTFORMAT 'com.amazon.emr.cloudtrail.CloudTrailInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat' LOCATION 's3://YourS3BucketName/PrefixNameIfYouHaveOne/AWSLogs/YourAccountNumber/CloudTrail/RegionName/'
Important! Before running the above query, be sure to substitute the following critical information to customize the table for your environment. (Substitutions are all located at the very last part of the command).
- YourS3BucketName: The unique S3 bucket name you created when turning on CloudTrail logging.
- PrefixNameIfYouHaveOne: If you set up a prefix name to help organize your S3 structure, include that too.
- YourAccountNumber: Available within your S3 bucket path, or click My Account in the drop-down menu in the upper right of the AWS Management Console. The account number is a unique 12-digit number.
- RegionName: If you want to narrow the table and subsequent SQL queries to a single region, you can enter that here. If you log to all regions and have a massive amount of data, consider breaking up table creation by region and querying separately. Note: If running a query takes over 30 minutes, the operation will time out. There is a 30-minute service limit imposed by AWS. You can either request an increase from Amazon by contacting support or find other ways to trim up your data (such as segmenting it by region as suggested here).
Click Run Query to create the table with the defined schema. The new table should appear in the left pane of the Athena console:
Click the table name to expand it and reveal the schema defined in the previous Create External Table command:
Click the Preview Data eyeball icon adjacent to the table name. The Preview Data icon is a shortcut for running the following SQL statement:
SELECT * FROM cloudtrail_logs limit 10;
Use the horizontal scroll bar to view the table columns and get a better idea of the type of data in your table.
The eventsource and eventname are particularly interesting. Eventsource shows the entity responsible for the entry. For example, AWS services such as S3, SNS, KMS, or CloudTrail. The eventname reveals the key events logged by the AWS service. Recall that some events from these services do not show up in the CloudTrail API activity history. For example, Amazon SNS Information in CloudTrail. CloudTrail logs key events from this service and includes them in the data saved to S3. However, the events are not readily available in the API activity history in CloudTrail.
Create External Table: A brief detour
The most challenging part of using Athena is defining the schema via the CREATE EXTERNAL TABLE command. In this post, we address the CloudTrail log file but realize that there are an infinite number of other use cases. Typing up the entire create table command might not flow from your fingertips, but Athena can help you there too. The Add Table command includes a four-step wizard. The Data Format and Columns steps can be very helpful.
Several common data formats are included in the wizard, including Apache Web Logs, comma separated values (CSV), tab separated values (TSV), text files with custom delimiters (such as a pipe “|”), and JSON.
You may have noticed the ROW FORMAT SERDE portion of the CREATE EXTERNAL TABLE command earlier. “SerDe” stands for serializer/deserializer. It is the secret sauce that simplifies the process of deserializing data from CloudTrail log files to create Athena tables. Each of the formats shown above has a SerDe. (See Supported Formats and SerDes for more information on other SerDes.)
Add each column in the table along with its data type. The wizard includes common data types in the drop-down menu. For a smaller data set, perhaps defining each column name and type is fine, one by one. Try the Bulk Add Columns feature for more extensive data sets. With the bulk add you define columns as comma separated, name value pairs.
For example, enter the following:
First string, Last string, CustID int, CustSince date, Support string, MTD float, YTD float
These settings build out the following column set for you:
From there, you can delete or add additional columns as needed before creating your table. Of course, the example above is a basic one, but the bulk add could have included hundreds of columns and can save you a lot of time and effort.Usually, you can pull this information out of a CSV or similar file, making the bulk add even easier to use.
Some easy sample queries to help get you started
List all subscriptions for each topic:
Select * from cloudtrail_logs where eventname='ListSubscriptionsByTopic';
Tip: Although the above query is relatively basic, some queries are lengthy and can prove difficult to read. Use the Format Query button to make the query more readable:
SNS events where topics were created or subscribed to:
SELECT * FROM cloudtrail_logs WHERE eventname='CreateTopic' OR eventname='Subscribe'
Check for errors in your CloudTrail logs on S3:
SELECT * FROM cloudtrail_logs WHERE errorcode IS NOT NULL
This may result in more errors than you are prepared to evaluate efficiently, and some are actually part of normal operation. It helps to know as much information about the error as possible to locate the information within your query results.
If there are too many errors as the result of the previous query, focus on the type of error message you are most interested in, and structure your query accordingly. For example, this query returns authorization errors:
SELECT * FROM cloudtrail_logs WHERE errorcode IS NOT NULL and errormessage LIKE 'Not %auth%'
Because CloudTrail receives events from many supported AWS services, if you are only interested in events from SNS, a query like this one can be very helpful:
SELECT * FROM cloudtrail_logs where eventsource LIKE '%sns%'
Of course, you can tailor the query to other services as well.
Check to see if there are any pending subscription requests:
SELECT * FROM cloudtrail_logs where responseelements LIKE '%pending%'
As previously mentioned, Amazon Key Management Service (KMS) is another service that is integrated with CloudTrail but is not supported by the CloudTrail API activity history. Another great use case for running a few queries through Athena.
To see all KMS events:
SELECT * FROM cloudtrail_logs where eventsource='kms.amazonaws.com'
To see all KMS events where a key is either created or deleted:
SELECT * FROM cloudtrail_logs WHERE eventsource='kms.amazonaws.com' AND eventname LIKE 'Delete%' OR eventname LIKE 'Create%'
Once loaded into a table in Athena, the data is only an SQL query away. Although CloudTrail logs and SNS or KMS were used as examples to pique your interests, the sky is the limit. Query for other AWS services, or create a table and query your own proprietary data on S3. Consider incorporating CloudWatch and Simple Notification Service (SNS) so that push notifications are sent to administrators when specific events occur. It just might make your life easier, and save you money.
I hope this post has given you some ideas for how your organization could use Athena. You’ll be an Athena ninja in no time!
You can check out our previous blog post on What is Amazon Athena?
To keep current, consult the following AWS Documentation:
Amazon Athena Advantages for Security and Speed
The less obvious, but really good to know part of Amazon Athena Back in August when I wrote Using Amazon Athena to Query S3 data for CloudTrail logs, I didn't originally intend for it to be a two-part post. You may have heard the proverb, "Necessity is the mother of invention." Well, n...