The Cloud Academy team tried to catch every detail of this amazing week-long conference. We ran from one session to another, got lost in the maze of booths, and met many enthusiastic customers at the re:Invent pub crawl and at re:Play, one of the biggest, most fun parties that I’ve ever attended.
What is Amazon Athena: a complete overview
Amazon Athena is probably the most promising of the services announced last week in Las Vegas. In fact, big data was one of the main topics discussed at re:Invent 2016, together with AI and IoT. We gathered a lot of information on Athena at the special session led by Rahul Pathak, general manager of Amazon EMR at AWS. In this post, I will cover Athena’s main features, use cases, and pricing details.
What is Amazon Athena? It is an interactive query service that makes it easy to directly analyze data on Amazon S3 using standard SQL. It means that you can store structured data on S3 and query that data as you’d do with an SQL database. Athena is serverless, meaning that there is no infrastructure to manage, no setup, servers, or data warehouses. The power of S3 storage is fully unleashed by the new Athena query engine without the need for maintenance. No infrastructure or administration is required: You can just create a table, load some data, and start querying.
As mentioned during the session, Athena complements Amazon Redshift and Amazon EMR.
Amazon Athena Features
Athena is backed by Presto, an open source distributed SQL query engine that allows you to run interactive analytic queries against data sources of all sizes, ranging from gigabytes to petabytes. Create Table statements or DDL (Data Definition Language) written in Apache Hive, which is meant to facilitate reading, writing, and managing large and distributed datasets. Hive supports SQL, but also allows concepts such external tables and data partitioning. Your metadata—such as table definitions, column names, etc.—is stored in the Athena metadata store.
As with any standard DBMS (Database Management System), Athena supports complex joins, nested queries, and window functions. Complex data types, such as arrays and struts, are also supported. Partitioning is easy to achieve by any key, including date and time custom keys. Of course, you can connect to Athena with your favourite SQL client.
You can store data in the form of objects with several file formats:
- Text files, CSV, war logs
- Apache web logs
- Compressed files
- Columnar formats, such as Apache Parquet or Apache ORC
Eventually, you may want to use Hive CTAS or Spark to convert data to ORC and PARQUET formats.
As soon as you perform a query you will obtain a data stream directly from Amazon S3, just as if you were querying a real SQL database. Queries can be executed both through APIs or from the AWS Console. By using the AWS Console, you will also get the query running time and the amount of data scanned, in bytes.
With Amazon Athena, you won’t have to worry about scaling, performance, and maintenance. You will have enough compute resources to get fast, interactive query performance. Athena will automatically execute queries in parallel over petabytes of data. Therefore, most results will come back within seconds. This is made possible because Athena uses warm compute pools across multiple Availability Zones.
As Rahul Pathak pointed out, Amazon Athena is really fast:
- Athena is tuned for performance.
- Queries are automatically parallelized.
- You can get a results to stream directly from the console.
- You can store query results in Amazon S3.
In my personal opinion, performance is still an open concern, as no benchmarks for big datasets have been publicly released, although we got very interesting performance results during the full session. The presenter used Apache Parquet format and, with just 20 lines of PySpark code running on EMR, we converted 1 TB of textual data into 130 GB of Apache Parquet data. This approach also optimized space occupation and query time, resulting in much lower costs.
Finally, the built-in integration with Amazon QuickSight allows you to visualize your data.
Amazon Athena Use Cases
During the session, Rahul Pathak presented two common use cases where Athena could be a game changer:
- Log storage and analysis
- Data warehouse for events
In such scenarios, the need to store gigabytes or petabytes of structured data can be a real problem. Accessing that data in a fast, easy, and secure way is even more difficult, painful, and time-consuming. Athena is focused on solving these problems by mixing together the power of Amazon S3 storage and the SQL query language. This allows you to operate on your data easily and without worrying about scaling. Indeed, you will get results within seconds, even on very large datasets.
What is Amazon Athena: pricing
Athena’s pricing is very simple: You pay only for the queries you run and you will be charged $5 per TB of scanned data from Amazon S3.
DDL statements (CREATE, ALTER, DROP), partitioning queries, and failed queries are completely free. If you cancel a query, you will be charged only for the scanned data up to that point. Of course, you can reduce costs by using compression, columnar formats, and partitions. With such techniques, Athena will have to scan fewer data from Amazon S3.
In practice, there is no charge directly related to computation itself, so you can always estimate the total cost purely based on the amount of data that you need to work with.
We are building a world that requires ever faster communication and where information has a key role in controlling markets, economies, and business activities. This requires us to be able to store and retrieve huge amounts of data. Whether you are launching a new product or during future iterations of established products, this is something that you cannot ignore.
Data storage and data analysis can drive product outcomes for both startups and large companies. As a result, the availability of easy, fast, and cheap tools for managing data is crucial in operating services delivery and maintenance.
Will Amazon Athena cover a big role in such world transformation? Atlassian is already using Amazon Athena, and I’m pretty sure that the number of adopters will increase over the next few months. The official information sounds really promising so far, and such an interesting technology backed by AWS infrastructure cannot pass unnoticed.
Let us know what you like or dislike about Amazon Athena and how it will affect your next project. Many of you have brilliant ideas and application scenarios, and we can’t wait to hear about them.
New Content: Platforms, Programming, and DevOps – Something for Everyone
This month our team of expert certification specialists released three new or updated learning paths, 16 courses, 13 hands-on labs, and four lab challenges! New content on Cloud Academy You can always visit our Content Roadmap to see what’s just released as well as what’s coming soon....
Mastering AWS Organizations Service Control Policies
Service Control Policies (SCPs) are IAM-like policies to manage permissions in AWS Organizations. SCPs restrict the actions allowed for accounts within the organization making each one of them compliant with your guidelines. SCPs are not meant to grant permissions; you should consider ...
New Content: Focus on DevOps and Programming Content this Month
This month our team of expert certification specialists released 12 new or updated learning paths, 15 courses, 25 hands-on labs, and four lab challenges! New content on Cloud Academy You can always visit our Content Roadmap to see what’s just released as well as what’s coming soon. Ja...
New Content: Get Ready for the CISM Cert Exam & Learn About Alibaba, Plus All the AWS, GCP, and Azure Courses You Know You Can Count On
This month our team of intrepid certification specialists released five learning paths, seven courses, 19 hands-on labs, and three lab challenges! One particularly interesting new learning path is Certified Information Security Manager (CISM) Foundations. After completing this learn...
Which Certifications Should I Get?
The old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and companies. With all that in mind, the s...
The 12 AWS Certifications: Which is Right for You and Your Team?
As companies increasingly shift workloads to the public cloud, cloud computing has moved from a nice-to-have to a core competency in the enterprise. This shift requires a new set of skills to design, deploy, and manage applications in cloud computing. As the market leader and most ma...
AWS Certified Solutions Architect Associate: A Study Guide
Want to take a really impactful step in your technical career? Explore the AWS Solutions Architect Associate certificate. Its new version (SAA-C02) was released on March 23, 2020. The AWS Solutions Architect - Associate Certification (or Sol Arch Associate for short) offers some ...
New Content: AWS Terraform, Java Programming Lab Challenges, Azure DP-900 & DP-300 Certification Exam Prep, Plus Plenty More Amazon, Google, Microsoft, and Big Data Courses
This month our Content Team continues building the catalog of courses for everyone learning about AWS, GCP, and Microsoft Azure. In addition, this month’s updates include several Java programming lab challenges and a couple of courses on big data. In total, we released five new learning...
Where Should You Be Focusing Your AWS Security Efforts?
Another day, another re:Invent session! This time I listened to Stephen Schmidt’s session, “AWS Security: Where we've been, where we're going.” Amongst covering the highlights of AWS security during 2020, a number of newly added AWS features/services were discussed, including: AWS Audit...
AWS re:Invent: 2020 Keynote Top Highlights and More
We’ve gotten through the first five days of the special all-virtual 2020 edition of AWS re:Invent. It’s always a really exciting time for practitioners in the field to see what features and services AWS has cooked up for the year ahead. This year’s conference is a marathon and not a...
WARNING: Great Cloud Content Ahead
At Cloud Academy, content is at the heart of what we do. We work with the world’s leading cloud and operations teams to develop video courses and learning paths that accelerate teams and drive digital transformation. First and foremost, we listen to our customers’ needs and we stay ahea...
Excelling in AWS, Azure, and Beyond – How Danut Prisacaru Prepares for the Future
Meet Danut Prisacaru. Danut has been a Software Architect for the past 10 years and has been involved in Software Engineering for 30 years. He’s passionate about software and learning, and jokes that coding is basically the only thing he can do well (!). We think his enthusiasm shines t...