The less obvious, but really good to know part of Amazon Athena
Back in August when I wrote Using Amazon Athena to Query S3 data for CloudTrail logs, I didn’t originally intend for it to be a two-part post. You may have heard the proverb, “Necessity is the mother of invention.” Well, necessity is often the catalyst for exploring new Amazon Web Services as well. Specifically, there were two issues that inspired me to learn more about Amazon Athena in the first place:
- Several AWS Services that were integrated with CloudTrail did not show up in the API Event History (API logs).
- It became impractical to manually navigate and extract meaningful data from Massive amounts of data throughout an S3 hierarchy.
To solve these issues, I turned to Amazon Athena. As a refresher, Athena is an interactive, serverless query service that allows you to query massive amounts of structured S3 data using standard structured query language (SQL) statements.
In learning about the service’s various features and benefits, I also came across several things that required a second look to understand them better, namely: dynamic data, recursion, and cost savings. If these aspects of the service were unclear to me, they might be unclear to you as well. Let’s dig in to take a closer look.
Once you create a database table in Athena, the data is dynamic. Why is that important? The table in Athena is not a snapshot in time.
Using CloudTrail logs as an example, new events delivered after the CREATE EXTERNAL TABLE command to CloudTrail logs on S3 will appear in query results. There is no need to drop the old table and create a new one before you run additional queries. Of course, this is because your data remains on S3. The CREATE EXTERNAL TABLE essentially defines a schema, but data is not loaded from S3 to a new database table. Instead, once the schema is defined, you can query what is already on S3 using familiar SQL. Very cool.
Athena queries are recursive against the structure specified in S3. This is extremely powerful but beware: it can impact costs for large amounts of data.
Although I will use CloudTrail as an example again, remember that this also applies to data in general. CloudTrail S3 logs may include all regions. However, you might choose to create tables by region and query against just a subset of the data. This approach can save on query response time and costs. (More on costs coming up next.)
Conversely, if your dataset is not very large, when you create the table you can leave the region off the end of the statement when specifying the S3 bucket and folder. For example, you could omit a region name such as “us-west-2” off the end of the create table SQL statement. The table created would include all regions, and the queries you run would include data for all regions as well. Each has its advantages and disadvantages, but regardless, Amazon tackles the recursion through the S3 structure for you automatically.
Amazon Athena is not known for incurring additional storage costs. Athena pricing is based per terabyte of data scanned by queries. The original S3 storage rates do apply, however. (Consult Amazon Athena Pricing for details.)
Because pricing is based on the amount of data you run queries against, Athena’s support for querying compressed files saves money! Recall that CloudTrail logs are compressed JSON files, yet there is no need to uncompress the files prior to running queries in Athena. Sweet!
Although orthogonal to the costs savings topic at hand, the data could be encrypted as well. If using Server Side Encryption (SSE) with AWS managed keys, Athena will decrypt the data as well so your query results will be readable. Even better! If that interests you, check out our Query encrypted S3 data with Amazon Athena beginner lab. (And stay tuned for our new intermediate-level lab to learn how to Deploy a highly available serverless application using AWS services, coming soon on Cloud Academy.)
Athena supports partitioning and columnar-formatted data, both of which can reduce costs significantly. (Note this can impact performance, however.)
Athena uses what is known as “schema-on-read” technology. This basically means that your table definitions are applied to your data in S3 when queries are executed. Again, there is no need to load or transfer data, (i.e. no extract-transform-load or ETL required). In fact, if you issue a DROP TABLE command in Athena, the underlying data is untouched. The data remains in S3, unaffected.
OK, one last time using CloudTrail as an example! Even if the CloudTrail API Event History does begin support for the AWS service you are interested in, you may still want to use Athena to front-end the logs on S3. Why? Speed (and automation capabilities). CloudTrail delivers API logs to S3 quickly (typically in less than five minutes), where the CloudTrail API activity history can take up to 15 minutes to display key events. If your event is security or outage related, every minute matters.
I hope these less than obvious details will be helpful for you in using Athena to access critical data in a timely, cost-effective manner. In the meantime, try your hand at Athena with our hands-on lab for querying S3 data.