The Cloud Academy team tried to catch every detail of this amazing week-long conference. We ran from one session to another, got lost in the maze of booths, and met many enthusiastic customers at the re:Invent pub crawl and at re:Play, one of the biggest, most fun parties that I’ve ever attended.
What is Amazon Athena: a complete overview
Amazon Athena is probably the most promising of the services announced last week in Las Vegas. In fact, big data was one of the main topics discussed at re:Invent 2016, together with AI and IoT. We gathered a lot of information on Athena at the special session led by Rahul Pathak, general manager of Amazon EMR at AWS. In this post, I will cover Athena’s main features, use cases, and pricing details.
What is Amazon Athena? It is an interactive query service that makes it easy to directly analyze data on Amazon S3 using standard SQL. It means that you can store structured data on S3 and query that data as you’d do with an SQL database. Athena is serverless, meaning that there is no infrastructure to manage, no setup, servers, or data warehouses. The power of S3 storage is fully unleashed by the new Athena query engine without the need for maintenance. No infrastructure or administration is required: You can just create a table, load some data, and start querying.
As mentioned during the session, Athena complements Amazon Redshift and Amazon EMR.
Amazon Athena Features
Athena is backed by Presto, an open source distributed SQL query engine that allows you to run interactive analytic queries against data sources of all sizes, ranging from gigabytes to petabytes. Create Table statements or DDL (Data Definition Language) written in Apache Hive, which is meant to facilitate reading, writing, and managing large and distributed datasets. Hive supports SQL, but also allows concepts such external tables and data partitioning. Your metadata—such as table definitions, column names, etc.—is stored in the Athena metadata store.
As with any standard DBMS (Database Management System), Athena supports complex joins, nested queries, and window functions. Complex data types, such as arrays and struts, are also supported. Partitioning is easy to achieve by any key, including date and time custom keys. Of course, you can connect to Athena with your favourite SQL client.
You can store data in the form of objects with several file formats:
- Text files, CSV, war logs
- Apache web logs
- Compressed files
- Columnar formats, such as Apache Parquet or Apache ORC
Eventually, you may want to use Hive CTAS or Spark to convert data to ORC and PARQUET formats.
As soon as you perform a query you will obtain a data stream directly from Amazon S3, just as if you were querying a real SQL database. Queries can be executed both through APIs or from the AWS Console. By using the AWS Console, you will also get the query running time and the amount of data scanned, in bytes.
With Amazon Athena, you won’t have to worry about scaling, performance, and maintenance. You will have enough compute resources to get fast, interactive query performance. Athena will automatically execute queries in parallel over petabytes of data. Therefore, most results will come back within seconds. This is made possible because Athena uses warm compute pools across multiple Availability Zones.
As Rahul Pathak pointed out, Amazon Athena is really fast:
- Athena is tuned for performance.
- Queries are automatically parallelized.
- You can get a results to stream directly from the console.
- You can store query results in Amazon S3.
In my personal opinion, performance is still an open concern, as no benchmarks for big datasets have been publicly released, although we got very interesting performance results during the full session. The presenter used Apache Parquet format and, with just 20 lines of PySpark code running on EMR, we converted 1 TB of textual data into 130 GB of Apache Parquet data. This approach also optimized space occupation and query time, resulting in much lower costs.
Finally, the built-in integration with Amazon QuickSight allows you to visualize your data.
Amazon Athena Use Cases
During the session, Rahul Pathak presented two common use cases where Athena could be a game changer:
- Log storage and analysis
- Data warehouse for events
In such scenarios, the need to store gigabytes or petabytes of structured data can be a real problem. Accessing that data in a fast, easy, and secure way is even more difficult, painful, and time-consuming. Athena is focused on solving these problems by mixing together the power of Amazon S3 storage and the SQL query language. This allows you to operate on your data easily and without worrying about scaling. Indeed, you will get results within seconds, even on very large datasets.
What is Amazon Athena: pricing
Athena’s pricing is very simple: You pay only for the queries you run and you will be charged $5 per TB of scanned data from Amazon S3.
DDL statements (CREATE, ALTER, DROP), partitioning queries, and failed queries are completely free. If you cancel a query, you will be charged only for the scanned data up to that point. Of course, you can reduce costs by using compression, columnar formats, and partitions. With such techniques, Athena will have to scan fewer data from Amazon S3.
In practice, there is no charge directly related to computation itself, so you can always estimate the total cost purely based on the amount of data that you need to work with.
We are building a world that requires ever faster communication and where information has a key role in controlling markets, economies, and business activities. This requires us to be able to store and retrieve huge amounts of data. Whether you are launching a new product or during future iterations of established products, this is something that you cannot ignore.
Data storage and data analysis can drive product outcomes for both startups and large companies. As a result, the availability of easy, fast, and cheap tools for managing data is crucial in operating services delivery and maintenance.
Will Amazon Athena cover a big role in such world transformation? Atlassian is already using Amazon Athena, and I’m pretty sure that the number of adopters will increase over the next few months. The official information sounds really promising so far, and such an interesting technology backed by AWS infrastructure cannot pass unnoticed.
Let us know what you like or dislike about Amazon Athena and how it will affect your next project. Many of you have brilliant ideas and application scenarios, and we can’t wait to hear about them.
New Content: AWS Terraform, Java Programming Lab Challenges, Azure DP-900 & DP-300 Certification Exam Prep, Plus Plenty More Amazon, Google, Microsoft, and Big Data Courses
This month our Content Team continues building the catalog of courses for everyone learning about AWS, GCP, and Microsoft Azure. In addition, this month’s updates include several Java programming lab challenges and a couple of courses on big data. In total, we released five new learning...
Where Should You Be Focusing Your AWS Security Efforts?
Another day, another re:Invent session! This time I listened to Stephen Schmidt’s session, “AWS Security: Where we've been, where we're going.” Amongst covering the highlights of AWS security during 2020, a number of newly added AWS features/services were discussed, including: AWS Audit...
AWS re:Invent: 2020 Keynote Top Highlights and More
We’ve gotten through the first five days of the special all-virtual 2020 edition of AWS re:Invent. It’s always a really exciting time for practitioners in the field to see what features and services AWS has cooked up for the year ahead. This year’s conference is a marathon and not a...
WARNING: Great Cloud Content Ahead
At Cloud Academy, content is at the heart of what we do. We work with the world’s leading cloud and operations teams to develop video courses and learning paths that accelerate teams and drive digital transformation. First and foremost, we listen to our customers’ needs and we stay ahea...
Excelling in AWS, Azure, and Beyond – How Danut Prisacaru Prepares for the Future
Meet Danut Prisacaru. Danut has been a Software Architect for the past 10 years and has been involved in Software Engineering for 30 years. He’s passionate about software and learning, and jokes that coding is basically the only thing he can do well (!). We think his enthusiasm shines t...
New Content: AWS Data Analytics – Specialty Certification, Azure AI-900 Certification, Plus New Learning Paths, Courses, Labs, and More
This month our Content Team released two big certification Learning Paths: the AWS Certified Data Analytics - Speciality, and the Azure AI Fundamentals AI-900. In total, we released four new Learning Paths, 16 courses, 24 assessments, and 11 labs. New content on Cloud Academy At any ...
New Content: Azure DP-100 Certification, Alibaba Cloud Certified Associate Prep, 13 Security Labs, and Much More
This past month our Content Team served up a heaping spoonful of new and updated content. Not only did our experts release the brand new Azure DP-100 Certification Learning Path, but they also created 18 new hands-on labs — and so much more! New content on Cloud Academy At any time, y...
AWS Certification Practice Exam: What to Expect from Test Questions
If you’re building applications on the AWS cloud or looking to get started in cloud computing, certification is a way to build deep knowledge in key services unique to the AWS platform. AWS currently offers 12 certifications that cover major cloud roles including Solutions Architect, De...
Overcoming Unprecedented Business Challenges with AWS
From auto-scaling applications with high availability to video conferencing that’s used by everyone, every day — cloud technology has never been more popular or in-demand. But what does this mean for experienced cloud professionals and the challenges they face as they carve out a new p...
Constant Content: Cloud Academy’s Q3 2020 Roadmap
Hello — Andy Larkin here, VP of Content at Cloud Academy. I am pleased to release our roadmap for the next three months of 2020 — August through October. Let me walk you through the content we have planned for you and how this content can help you gain skills, get certified, and...
New Content: Alibaba, Azure AZ-303 and AZ-304, Site Reliability Engineering (SRE) Foundation, Python 3 Programming, 16 Hands-on Labs, and Much More
This month our Content Team did an amazing job at publishing and updating a ton of new content. Not only did our experts release the brand new AZ-303 and AZ-304 Certification Learning Paths, but they also created 16 new hands-on labs — and so much more! New content on Cloud Academy At...
Blog Digest: Which Certifications Should I Get?, The 12 Microsoft Azure Certifications, 6 Ways to Prevent a Data Breach, and More
This month, we were excited to announce that Cloud Academy was recognized in the G2 Summer 2020 reports! These reports highlight the top-rated solutions in the industry, as chosen by the source that matters most: customers. We're grateful to have been nominated as a High Performer in se...