What is Amazon Athena? The New Game Changer for Big Data

What is Amazon Athena: the 2016 edition of AWS re:Invent was an exciting week of announcements from Andy Jassy and Werner Vogels on pricing reductions, killer features, and plenty of new services.

The Cloud Academy team tried to catch every detail of this amazing week-long conference. We ran from one session to another, got lost in the maze of booths, and met many enthusiastic customers at the re:Invent pub crawl and at re:Play, one of the biggest, most fun parties that I’ve ever attended.

What is Amazon Athena: a complete overview

Amazon Athena is probably the most promising of the services announced last week in Las Vegas. In fact, big data was one of the main topics discussed at re:Invent 2016, together with AI and IoT. We gathered a lot of information on Athena at the special session led by Rahul Pathak, general manager of Amazon EMR at AWS. In this post, I will cover Athena’s main features, use cases, and pricing details.

What is Amazon Athena?  It is an interactive query service that makes it easy to directly analyze data on Amazon S3 using standard SQL. It means that you can store structured data on S3 and query that data as you’d do with an SQL database. Athena is serverless, meaning that there is no infrastructure to manage, no setup, servers, or data warehouses. The power of S3 storage is fully unleashed by the new Athena query engine without the need for maintenance. No infrastructure or administration is required: You can just create a table, load some data, and start querying.

As mentioned during the session, Athena complements Amazon Redshift and Amazon EMR.
athena-arch

Amazon Athena Features

Athena is backed by Presto, an open source distributed SQL query engine that  allows you to run interactive analytic queries against data sources of all sizes, ranging from gigabytes to petabytes. Create Table statements or DDL (Data Definition Language) written in Apache Hive, which is meant to facilitate reading, writing, and managing large and distributed datasets. Hive supports SQL, but also allows concepts such external tables and data partitioning. Your metadata—such as table definitions, column names, etc.—is stored in the Athena metadata store.
As with any standard DBMS (Database Management System), Athena supports complex joins, nested queries, and window functions. Complex data types, such as arrays and struts, are also supported. Partitioning is easy to achieve by any key, including date and time custom keys. Of course, you can connect to Athena with your favourite SQL client.
You can store data in the form of objects with several file formats:

  • Text files, CSV, war logs
  • Apache web logs
  • JSON
  • Compressed files
  • Columnar formats, such as Apache Parquet or Apache ORC

Eventually, you may want to use Hive CTAS or Spark to convert data to ORC and PARQUET formats.
Amazon Athena Console
As soon as you perform a query you will obtain a data stream directly from Amazon S3, just as if you were querying a real SQL database. Queries can be executed both through APIs or from the AWS Console. By using the AWS Console, you will also get the query running time and the amount of data scanned, in bytes.

With Amazon Athena, you won’t have to worry about scaling, performance, and maintenance. You will have enough compute resources to get fast, interactive query performance. Athena will automatically execute queries in parallel over petabytes of data. Therefore, most results will come back within seconds. This is made possible because Athena uses warm compute pools across multiple Availability Zones.

As Rahul Pathak pointed out, Amazon Athena is really fast:

  • Athena is tuned for performance.
  • Queries are automatically parallelized.
  • You can get a results to stream directly from the console.
  • You can store query results in Amazon S3.

In my personal opinion, performance is still an open concern, as no benchmarks for big datasets have been publicly released, although we got very interesting performance results during the full session. The presenter used Apache Parquet format and, with just 20 lines of PySpark code running on EMR, we converted 1 TB of textual data into 130 GB of Apache Parquet data. This approach also optimized space occupation and query time, resulting in much lower costs.

Finally, the built-in integration with Amazon QuickSight allows you to visualize your data.
athena-insights

Amazon Athena Use Cases

During the session, Rahul Pathak presented two common use cases where Athena could be a game changer:

  • Log storage and analysis
  • Data warehouse for events

In such scenarios, the need to store gigabytes or petabytes of structured data can be a real problem. Accessing that data in a fast, easy, and secure way is even more difficult, painful, and time-consuming. Athena is focused on solving these problems by mixing together the power of Amazon S3 storage and the SQL query language. This allows you to operate on your data easily and without worrying about scaling. Indeed, you will get results within seconds, even on very large datasets.

What is Amazon Athena: pricing

Athena’s pricing is very simple: You pay only for the queries you run and you will be charged $5 per TB of scanned data from Amazon S3.

DDL statements (CREATE, ALTER, DROP), partitioning queries, and failed queries are completely free. If you cancel a query, you will be charged only for the scanned data up to that point. Of course, you can reduce costs by using compression, columnar formats, and partitions. With such techniques, Athena will have to scan fewer data from Amazon S3.

In practice, there is no charge directly related to computation itself, so you can always estimate the total cost purely based on the amount of data that you need to work with.

Conclusion

We are building a world that requires ever faster communication and where information has a key role in controlling markets, economies, and business activities. This requires us to be able to store and retrieve huge amounts of data. Whether you are launching a new product or during future iterations of established products, this is something that you cannot ignore.
Data storage and data analysis can drive product outcomes for both startups and large companies. As a result, the availability of easy, fast, and cheap tools for managing data is crucial in operating services delivery and maintenance.

Will Amazon Athena cover a big role in such world transformation? Atlassian is already using Amazon Athena, and I’m pretty sure that the number of adopters will increase over the next few months. The official information sounds really promising so far, and such an interesting technology backed by AWS infrastructure cannot pass unnoticed.

Let us know what you like or dislike about Amazon Athena and how it will affect your next project. Many of you have brilliant ideas and application scenarios, and we can’t wait to hear about them.

Avatar

Written by

Antonio Trapani

Computer Engineer, passionate software developer, eXtreme programmer, snowboarder. Music and beer addict. Despite his strong computer background, he constantly explores new and smart ways to create things, believing that code is creation, and creation is art. Hungry about new technologies, he likes to work in web startups.


Related Posts

Amanda Cross
Amanda Cross
— January 7, 2021

New Content: AWS Terraform, Java Programming Lab Challenges, Azure DP-900 & DP-300 Certification Exam Prep, Plus Plenty More Amazon, Google, Microsoft, and Big Data Courses

This month our Content Team continues building the catalog of courses for everyone learning about AWS, GCP, and Microsoft Azure. In addition, this month’s updates include several Java programming lab challenges and a couple of courses on big data. In total, we released five new learning...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Avatar
Stuart Scott
— December 17, 2020

Where Should You Be Focusing Your AWS Security Efforts?

Another day, another re:Invent session! This time I listened to Stephen Schmidt’s session, “AWS Security: Where we've been, where we're going.” Amongst covering the highlights of AWS security during 2020, a number of newly added AWS features/services were discussed, including: AWS Audit...

Read more
  • AWS
  • AWS re:Invent
  • cloud security
Joe Nemer
Joe Nemer
— December 4, 2020

AWS re:Invent: 2020 Keynote Top Highlights and More

We’ve gotten through the first five days of the special all-virtual 2020 edition of AWS re:Invent. It’s always a really exciting time for practitioners in the field to see what features and services AWS has cooked up for the year ahead.  This year’s conference is a marathon and not a...

Read more
  • AWS
  • AWS Glue Elastic Views
  • AWS re:Invent
Bryony Harrower
Bryony Harrower
— November 6, 2020

WARNING: Great Cloud Content Ahead

At Cloud Academy, content is at the heart of what we do. We work with the world’s leading cloud and operations teams to develop video courses and learning paths that accelerate teams and drive digital transformation. First and foremost, we listen to our customers’ needs and we stay ahea...

Read more
  • AWS
  • Azure
  • content roadmap
  • GCP
Joe Nemer
Joe Nemer
— October 25, 2020

Excelling in AWS, Azure, and Beyond – How Danut Prisacaru Prepares for the Future

Meet Danut Prisacaru. Danut has been a Software Architect for the past 10 years and has been involved in Software Engineering for 30 years. He’s passionate about software and learning, and jokes that coding is basically the only thing he can do well (!). We think his enthusiasm shines t...

Read more
  • AWS
  • careers
  • champions
  • upskilling
Joe Nemer
Joe Nemer
— October 14, 2020

New Content: AWS Data Analytics – Specialty Certification, Azure AI-900 Certification, Plus New Learning Paths, Courses, Labs, and More

This month our Content Team released two big certification Learning Paths: the AWS Certified Data Analytics - Speciality, and the Azure AI Fundamentals AI-900. In total, we released four new Learning Paths, 16 courses, 24 assessments, and 11 labs.  New content on Cloud Academy At any ...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— September 15, 2020

New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More

This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Joe Nemer
Joe Nemer
— August 28, 2020

AWS Certification Practice Exam: What to Expect from Test Questions

If you’re building applications on the AWS cloud or looking to get started in cloud computing, certification is a way to build deep knowledge in key services unique to the AWS platform. AWS currently offers 12 certifications that cover major cloud roles including Solutions Architect, De...

Read more
  • AWS
  • AWS Certifications
Patrick Navarro
Patrick Navarro
— August 25, 2020

Overcoming Unprecedented Business Challenges with AWS

From auto-scaling applications with high availability to video conferencing that’s used by everyone, every day —  cloud technology has never been more popular or in-demand. But what does this mean for experienced cloud professionals and the challenges they face as they carve out a new p...

Read more
  • AWS
  • Cloud Adoption
  • digital transformation
Avatar
Andrew Larkin
— August 18, 2020

Constant Content: Cloud Academy’s Q3 2020 Roadmap

Hello —  Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...

Read more
  • alibaba
  • AWS
  • Azure
  • content roadmap
  • Content updates
  • DevOps
  • GCP
  • Google Cloud
  • New content
Alisha Reyes
Alisha Reyes
— August 5, 2020

New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More

This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...

Read more
  • AWS
  • Azure
  • DevOps
  • Google Cloud Platform
  • Machine Learning
  • programming
Alisha Reyes
Alisha Reyes
— July 16, 2020

Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More

This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...

Read more
  • AWS
  • Azure
  • blog digest
  • Certifications
  • Cloud Academy
  • OWASP
  • OWASP Top 10
  • Security
  • VPCs