The Cloud Academy team tried to catch every detail of this amazing week-long conference. We ran from one session to another, got lost in the maze of booths, and met many enthusiastic customers at the re:Invent pub crawl and at re:Play, one of the biggest, most fun parties that I’ve ever attended.
What is Amazon Athena: a complete overview
Amazon Athena is probably the most promising of the services announced last week in Las Vegas. In fact, big data was one of the main topics discussed at re:Invent 2016, together with AI and IoT. We gathered a lot of information on Athena at the special session led by Rahul Pathak, general manager of Amazon EMR at AWS. In this post, I will cover Athena’s main features, use cases, and pricing details.
What is Amazon Athena? It is an interactive query service that makes it easy to directly analyze data on Amazon S3 using standard SQL. It means that you can store structured data on S3 and query that data as you’d do with an SQL database. Athena is serverless, meaning that there is no infrastructure to manage, no setup, servers, or data warehouses. The power of S3 storage is fully unleashed by the new Athena query engine without the need for maintenance. No infrastructure or administration is required: You can just create a table, load some data, and start querying.
As mentioned during the session, Athena complements Amazon Redshift and Amazon EMR.
Amazon Athena Features
Athena is backed by Presto, an open source distributed SQL query engine that allows you to run interactive analytic queries against data sources of all sizes, ranging from gigabytes to petabytes. Create Table statements or DDL (Data Definition Language) written in Apache Hive, which is meant to facilitate reading, writing, and managing large and distributed datasets. Hive supports SQL, but also allows concepts such external tables and data partitioning. Your metadata—such as table definitions, column names, etc.—is stored in the Athena metadata store.
As with any standard DBMS (Database Management System), Athena supports complex joins, nested queries, and window functions. Complex data types, such as arrays and struts, are also supported. Partitioning is easy to achieve by any key, including date and time custom keys. Of course, you can connect to Athena with your favourite SQL client.
You can store data in the form of objects with several file formats:
- Text files, CSV, war logs
- Apache web logs
- Compressed files
- Columnar formats, such as Apache Parquet or Apache ORC
Eventually, you may want to use Hive CTAS or Spark to convert data to ORC and PARQUET formats.
As soon as you perform a query you will obtain a data stream directly from Amazon S3, just as if you were querying a real SQL database. Queries can be executed both through APIs or from the AWS Console. By using the AWS Console, you will also get the query running time and the amount of data scanned, in bytes.
With Amazon Athena, you won’t have to worry about scaling, performance, and maintenance. You will have enough compute resources to get fast, interactive query performance. Athena will automatically execute queries in parallel over petabytes of data. Therefore, most results will come back within seconds. This is made possible because Athena uses warm compute pools across multiple Availability Zones.
As Rahul Pathak pointed out, Amazon Athena is really fast:
- Athena is tuned for performance.
- Queries are automatically parallelized.
- You can get a results to stream directly from the console.
- You can store query results in Amazon S3.
In my personal opinion, performance is still an open concern, as no benchmarks for big datasets have been publicly released, although we got very interesting performance results during the full session. The presenter used Apache Parquet format and, with just 20 lines of PySpark code running on EMR, we converted 1 TB of textual data into 130 GB of Apache Parquet data. This approach also optimized space occupation and query time, resulting in much lower costs.
Finally, the built-in integration with Amazon QuickSight allows you to visualize your data.
Amazon Athena Use Cases
During the session, Rahul Pathak presented two common use cases where Athena could be a game changer:
- Log storage and analysis
- Data warehouse for events
In such scenarios, the need to store gigabytes or petabytes of structured data can be a real problem. Accessing that data in a fast, easy, and secure way is even more difficult, painful, and time-consuming. Athena is focused on solving these problems by mixing together the power of Amazon S3 storage and the SQL query language. This allows you to operate on your data easily and without worrying about scaling. Indeed, you will get results within seconds, even on very large datasets.
What is Amazon Athena: pricing
Athena’s pricing is very simple: You pay only for the queries you run and you will be charged $5 per TB of scanned data from Amazon S3.
DDL statements (CREATE, ALTER, DROP), partitioning queries, and failed queries are completely free. If you cancel a query, you will be charged only for the scanned data up to that point. Of course, you can reduce costs by using compression, columnar formats, and partitions. With such techniques, Athena will have to scan fewer data from Amazon S3.
In practice, there is no charge directly related to computation itself, so you can always estimate the total cost purely based on the amount of data that you need to work with.
We are building a world that requires ever faster communication and where information has a key role in controlling markets, economies, and business activities. This requires us to be able to store and retrieve huge amounts of data. Whether you are launching a new product or during future iterations of established products, this is something that you cannot ignore.
Data storage and data analysis can drive product outcomes for both startups and large companies. As a result, the availability of easy, fast, and cheap tools for managing data is crucial in operating services delivery and maintenance.
Will Amazon Athena cover a big role in such world transformation? Atlassian is already using Amazon Athena, and I’m pretty sure that the number of adopters will increase over the next few months. The official information sounds really promising so far, and such an interesting technology backed by AWS infrastructure cannot pass unnoticed.
Let us know what you like or dislike about Amazon Athena and how it will affect your next project. Many of you have brilliant ideas and application scenarios, and we can’t wait to hear about them.
New on Cloud Academy: Red Hat, Agile, OWASP Labs, Amazon SageMaker Lab, Linux Command Line Lab, SQL, Git Labs, Scrum Master, Azure Architects Lab, and Much More
Happy New Year! We hope you're ready to kick your training in overdrive in 2020 because we have a ton of new content for you. Not only do we have a bunch of new courses, hands-on labs, and lab challenges on AWS, Azure, and Google Cloud, but we also have three new courses on Red Hat, th...
Cloud Academy’s Blog Digest: Azure Best Practices, 6 Reasons You Should Get AWS Certified, Google Cloud Certification Prep, and more
Happy Holidays from Cloud Academy We hope you have a wonderful holiday season filled with family, friends, and plenty of food. Here at Cloud Academy, we are thankful for our amazing customer like you. Since this time of year can be stressful, we’re sharing a few of our latest article...
Google Cloud Platform Certification: Preparation and Prerequisites
Google Cloud Platform (GCP) has evolved from being a niche player to a serious competitor to Amazon Web Services and Microsoft Azure. In 2019, research firm Gartner placed Google in the Leaders quadrant in its Magic Quadrant for Cloud Infrastructure as a Service for the second consecuti...
New Lab Challenges: Push Your Skills to the Next Level
Build hands-on experience using real accounts on AWS, Azure, Google Cloud Platform, and more Meaningful cloud skills require more than book knowledge. Hands-on experience is required to translate knowledge into real-world results. We see this time and time again in studies about how pe...
New on Cloud Academy: AWS Solution Architect Lab Challenge, Azure Hands-on Labs, Foundation Certificate in Cyber Security, and Much More
Now that Thanksgiving is over and the craziness of Black Friday has died down, it's now time for the busiest season of the year. Whether you're a last-minute shopper or you already have your shopping done, the holidays bring so much more excitement than any other time of year. Since our...
Understanding Enterprise Cloud Migration
What is enterprise cloud migration? Cloud migration is about moving your data, applications, and even infrastructure from your on-premises computers or infrastructure to a virtual pool of on-demand, shared resources that offer compute, storage, and network services at scale. Why d...
6 Reasons Why You Should Get an AWS Certification This Year
In the past decade, the rise of cloud computing has been undeniable. Businesses of all sizes are moving their infrastructure and applications to the cloud. This is partly because the cloud allows businesses and their employees to access important information from just about anywhere. ...
AWS Regions and Availability Zones: The Simplest Explanation You Will Ever Find Around
The basics of AWS Regions and Availability Zones We’re going to treat this article as a sort of AWS 101 — it’ll be a quick primer on AWS Regions and Availability Zones that will be useful for understanding the basics of how AWS infrastructure is organized. We’ll define each section,...
Application Load Balancer vs. Classic Load Balancer
What is an Elastic Load Balancer? This post covers basics of what an Elastic Load Balancer is, and two of its examples: Application Load Balancers and Classic Load Balancers. For additional information — including a comparison that explains Network Load Balancers — check out our post o...
Advantages and Disadvantages of Microservices Architecture
What are microservices? Let's start our discussion by setting a foundation of what microservices are. Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs). ...
Kubernetes Services: AWS vs. Azure vs. Google Cloud
Kubernetes is a popular open-source container orchestration platform that allows us to deploy and manage multi-container applications at scale. Businesses are rapidly adopting this revolutionary technology to modernize their applications. Cloud service providers — such as Amazon Web Ser...
AWS Internet of Things (IoT): The 3 Services You Need to Know
The Internet of Things (IoT) embeds technology into any physical thing to enable never-before-seen levels of connectivity. IoT is revolutionizing industries and creating many new market opportunities. Cloud services play an important role in enabling deployment of IoT solutions that min...