Amazon Athena - In-depth Review
The course is part of this learning path
We dive into the internals of the Amazon Athena service, describing in detail the individual parts that make up the service. We'll introduce you to the key features and core components of the Athena service. We then take an in-depth look at the following components of Amazon Athena: Databases, Schemas, Tables, Partitions, and SQL Queries.
Let's now dive in to the details of the Amazon Athena service. Amazon Athena is an interactive query service that makes it easy to directly query data stored in S3 using standard SQL. With this service, you can now store your data in S3 and have the ability to directly query that data. No infrastructure or administration is required. You simply create a table, load some data, and start querying. Athena is ideal for the following use cases. Ad hoc exploration of data and data discovery, anyone looking to analyze and mine data stored in Amazon S3, processing stored structured data from IoT devices and application logs, ability to reuse in-house SQL skills and competencies, ability to leverage SQL without a need for a full blown relational database and associated running costs. As already alluded to, Amazon Athena has been designed with simplicity in mind. The service is extremely easy to get set up, and once up and running, becomes a powerful and productive analytics tool. At a high level, the following steps are all that is required to get going with Athena. Log in to the AWS console. Create a table using Hive DDL. Start querying.
Amazon Athena is a serverless based service. Meaning that there is no infrastructure to manage, no servers, no data warehouses to implement, no capacity planning, et cetera. With zero spin up time, you simply create an Athena data table by specifying a schema, and then load it with your S3 data. Amazon Athena can process unstructured, semi-structured, and structured data sets. Amazon Athena supports multiple data formats including the following: text files including CSV, TSV, and raw logs, Apache web logs, JSON encoded files, both simple and nested, compressed files, columnar formats such as Apache Parquet and Apache ORC, AVRO formatted files. Athena has been designed and tuned with performance in mind. Athena will automatically parallelize running queries.
Results are streamed back to the AWS console as soon as the first matching records are found. Additional performance optimizations can be used to improve query time speed, data compression, data partitioning, and/or columnar formats. Amazon Athena costs are based on the number of bytes scanned by Amazon Athena, rounded up to the nearest megabyte with a 10 megabyte minimum per query. AWS charges at a rate of $5 per terabyte of data scanned at query time. To reduce and minimize accrued costs, compress your data. Partition and/or convert it to columnar storage formats. In doing so, you'll achieve significant costs savings and performance gains since Athena will scan less data. Converting data into a columnar format allows Athena to read only the columns it needs to process the query. In summary, always consider this, anything you can do to reduce the amount of data that's being scanned will help you reduce your Amazon Athena query costs. Since Athena is built on a Hive Metastore, HiveQL syntax is used to author and write DDL statements, those such as CREATE and ALTER TABLE statements.
On the other hand, at query time, Athena queries are authored in standard ANSI SQL compliant syntax. Athena is designed to process data using a schema on read technique. This allows you to project your schema onto your data at the time your query is evaluated. This approach eliminates any ETL requirements and/or data transformation pipelines, and as such, boosts productivity for exploring and mining data stored in S3. Let's now take a look at an example CREATE TABLE statement. As you can see, it consists of several key parts. The EXTERNAL keyword is always required. The CREATE statement will fail if the EXTERNAL keyword is missing. When supplied, it informS Athena to build a view over the data on S3. If an Athena table is dropped, the raw data remains unmodified in the S3 bucket. The SERDE keyword indicates the serialization/deserialization format to work with. In this example, we're using a regular expression parser as indicated by the reference to the RegexSerDe library. Here, the supplied RegEx pattern is used to extract out the data fields. Finally, the LOCATION keyword specifies the location of the data set as stored in S3. Athena currently supports the following data formats; Avro, CloudTrail, CSV, TSV, Parquet Serde, Grok, ORC, JSON. Note, unstructured files or custom files formats for which don't have a dedicated Serde can be attacked using either the RegEx or Grok Serdes. Athena currently supports several compression formats. They are Snappy, Zlib, GZIP, LZO. As mentioned earlier, partitioning your data within Athena allows you to restrict the amount of data scanned by each query, and therefore both increases the performance of the query, and more importantly, trims the overall cost of the query. Athena table partitions are defined and created as part of the CREATE TABLE statement.
The PARTITIONED BY keyword is used to define the keys you decide to partition your data on. Partitioning data is a two-step process. First, you need to change your schema definition to include partitions. Next, you'd load the partition metadata into Athena. Since partitioning is important in terms of reducing costs, we'll take a closer look at this. In this example, we'll be storing data in JSON format with the following structure. Our data is stored and categorized in S3 using the following scheme shown here. We start by defining our table schema. In doing so, we've decided to partition our data on the age attribute. The PARTITIONED BY keyword is used to define the attribute upon which the partitioning will be established on. Note, you can partition on multiple attributes. In fact, a common practice is to partition data based on time attributes of year, month, and day, which leads to a multilevel partitioning scheme. Finally, we need to load the partitions into the table before we can start querying the data. There are two ways to load your partitions. One, manually by adding each partition using an ALTER TABLE statement. Two, automatically by adding your partitions using the MSCK REPAIR TABLE statement. In order to do this, your object key names must conform to a specific pattern. In our example, we'll go with the first option. Let's see how this looks. We individually set up partitions, one per unique age folder.
Each ALTER TABLE statement needs to be executed in our selection. That is, they need to be executed one at a time. With everything now in place, we can run our query ensuring that the data we scan in S3 is reduced. For example, if we search for users who are younger than 21 years of age, then the execution of this query will result in fewer raw files and bytes being scanned. Additionally, the query execution time will be significantly faster comparatively had the table not been partitioned. Moving on, let's now talk about the Athena Metadata Store. Amazon Athena uses an internal data catalog which stores information about your databases and table schemas that you create for your data stored in Amazon S3. You can modify the catalog using DDL statements. Any schemas that you define are automatically saved unless you explicitly delete them. The Athena data catalog is itself stored in an S3 bucket. The catalog gets created at database creation time. As seen here, the LOCATION keyword specifies where the Metastore catalog within S3 will be stored. Now that we've covered off a lot of the underlying fundamentals of the Athena service, we'll now start to dive in to the AWS Athena service console and cover off each of the important features. First up is the Query Editor.
The Query Editor is where all the action takes place. Users can author DDL statements and perform SQL queries directly from within the Query Editor. The Query Editor provides several productivity features such as auto-formatting of SQL statements, Saved Queries, and History. We'll take a closer look at the Saved Queries and History features in the next two slides. Note, when you execute an SQL query, the results are written themselves back into an S3 bucket default named aws-athena-query-results-ACCOUNTID-REGION, where ACCOUNTID is your AWS account ID and REGION is the AWS region you're currently operating within. You can change the default location in the console and set encryption options via the Settings menu item. The Saved Queries features is self-explanatory. It saves you time and effort from having to reconstruct important queries. Clicking on a query within the Saved Queries view auto-recalls it back into the Query Editor.
The History feature maintains a list of previously executed queries. Each previously executed query is recorded with the following attributes; query SQL, query submission time, encryption type, state, run time, data scanned, and action. Worthy of extra mention are the run time and data scanned attributes. As these can be used to inform you of perhaps an unperformed and/or costly query. Finally, the action column allows you to download the result evaluated at the time of execution for any query that has a state of succeeded. You can connect to Athena with other SQL clients and/or third party business intelligence tools using standard JDBC based connectivity. For example, Tableau, a data analytics visualization client, can be configured to connect to Athena using an Athena provided JDBC driver. Finally, Athena can be controlled from the command line using the standard AWSCLI. For example, to create a named query, you would run something similar to the following.
About the Author
Jeremy is the DevOps Content Lead at Cloud Academy where he specializes in developing technical training documentation for DevOps.
He has a strong background in software engineering, and has been coding with various languages, frameworks, and systems for the past 20+ years. In recent times, Jeremy has been focused on DevOps, Cloud, Security, and Machine Learning.
Jeremy holds professional certifications for both the AWS and GCP cloud platforms.