The Cloud Academy team tried to catch every detail of this amazing week-long conference. We ran from one session to another, got lost in the maze of booths, and met many enthusiastic customers at the re:Invent pub crawl and at re:Play, one of the biggest, most fun parties that I’ve ever attended.
What is Amazon Athena: a complete overview
Amazon Athena is probably the most promising of the services announced last week in Las Vegas. In fact, big data was one of the main topics discussed at re:Invent 2016, together with AI and IoT. We gathered a lot of information on Athena at the special session led by Rahul Pathak, general manager of Amazon EMR at AWS. In this post, I will cover Athena’s main features, use cases, and pricing details.
What is Amazon Athena? It is an interactive query service that makes it easy to directly analyze data on Amazon S3 using standard SQL. It means that you can store structured data on S3 and query that data as you’d do with an SQL database. Athena is serverless, meaning that there is no infrastructure to manage, no setup, servers, or data warehouses. The power of S3 storage is fully unleashed by the new Athena query engine without the need for maintenance. No infrastructure or administration is required: You can just create a table, load some data, and start querying.
As mentioned during the session, Athena complements Amazon Redshift and Amazon EMR.
Amazon Athena Features
Athena is backed by Presto, an open source distributed SQL query engine that allows you to run interactive analytic queries against data sources of all sizes, ranging from gigabytes to petabytes. Create Table statements or DDL (Data Definition Language) written in Apache Hive, which is meant to facilitate reading, writing, and managing large and distributed datasets. Hive supports SQL, but also allows concepts such external tables and data partitioning. Your metadata—such as table definitions, column names, etc.—is stored in the Athena metadata store.
As with any standard DBMS (Database Management System), Athena supports complex joins, nested queries, and window functions. Complex data types, such as arrays and struts, are also supported. Partitioning is easy to achieve by any key, including date and time custom keys. Of course, you can connect to Athena with your favourite SQL client.
You can store data in the form of objects with several file formats:
- Text files, CSV, war logs
- Apache web logs
- Compressed files
- Columnar formats, such as Apache Parquet or Apache ORC
Eventually, you may want to use Hive CTAS or Spark to convert data to ORC and PARQUET formats.
As soon as you perform a query you will obtain a data stream directly from Amazon S3, just as if you were querying a real SQL database. Queries can be executed both through APIs or from the AWS Console. By using the AWS Console, you will also get the query running time and the amount of data scanned, in bytes.
With Amazon Athena, you won’t have to worry about scaling, performance, and maintenance. You will have enough compute resources to get fast, interactive query performance. Athena will automatically execute queries in parallel over petabytes of data. Therefore, most results will come back within seconds. This is made possible because Athena uses warm compute pools across multiple Availability Zones.
As Rahul Pathak pointed out, Amazon Athena is really fast:
- Athena is tuned for performance.
- Queries are automatically parallelized.
- You can get a results to stream directly from the console.
- You can store query results in Amazon S3.
In my personal opinion, performance is still an open concern, as no benchmarks for big datasets have been publicly released, although we got very interesting performance results during the full session. The presenter used Apache Parquet format and, with just 20 lines of PySpark code running on EMR, we converted 1 TB of textual data into 130 GB of Apache Parquet data. This approach also optimized space occupation and query time, resulting in much lower costs.
Finally, the built-in integration with Amazon QuickSight allows you to visualize your data.
Amazon Athena Use Cases
During the session, Rahul Pathak presented two common use cases where Athena could be a game changer:
- Log storage and analysis
- Data warehouse for events
In such scenarios, the need to store gigabytes or petabytes of structured data can be a real problem. Accessing that data in a fast, easy, and secure way is even more difficult, painful, and time-consuming. Athena is focused on solving these problems by mixing together the power of Amazon S3 storage and the SQL query language. This allows you to operate on your data easily and without worrying about scaling. Indeed, you will get results within seconds, even on very large datasets.
What is Amazon Athena: pricing
Athena’s pricing is very simple: You pay only for the queries you run and you will be charged $5 per TB of scanned data from Amazon S3.
DDL statements (CREATE, ALTER, DROP), partitioning queries, and failed queries are completely free. If you cancel a query, you will be charged only for the scanned data up to that point. Of course, you can reduce costs by using compression, columnar formats, and partitions. With such techniques, Athena will have to scan fewer data from Amazon S3.
In practice, there is no charge directly related to computation itself, so you can always estimate the total cost purely based on the amount of data that you need to work with.
We are building a world that requires ever faster communication and where information has a key role in controlling markets, economies, and business activities. This requires us to be able to store and retrieve huge amounts of data. Whether you are launching a new product or during future iterations of established products, this is something that you cannot ignore.
Data storage and data analysis can drive product outcomes for both startups and large companies. As a result, the availability of easy, fast, and cheap tools for managing data is crucial in operating services delivery and maintenance.
Will Amazon Athena cover a big role in such world transformation? Atlassian is already using Amazon Athena, and I’m pretty sure that the number of adopters will increase over the next few months. The official information sounds really promising so far, and such an interesting technology backed by AWS infrastructure cannot pass unnoticed.
Let us know what you like or dislike about Amazon Athena and how it will affect your next project. Many of you have brilliant ideas and application scenarios, and we can’t wait to hear about them.
Advantages and Disadvantages of Microservices Architecture
What are microservices? Let's start our discussion by setting a foundation of what microservices are. Microservices are a way of breaking large software projects into loosely coupled modules, which communicate with each other through simple Application Programming Interfaces (APIs). ...
Kubernetes Services: AWS vs. Azure vs. Google Cloud
Kubernetes is a popular open-source container orchestration platform that allows us to deploy and manage multi-container applications at scale. Businesses are rapidly adopting this revolutionary technology to modernize their applications. Cloud service providers — such as Amazon Web Ser...
AWS Internet of Things (IoT): The 3 Services You Need to Know
The Internet of Things (IoT) embeds technology into any physical thing to enable never-before-seen levels of connectivity. IoT is revolutionizing industries and creating many new market opportunities. Cloud services play an important role in enabling deployment of IoT solutions that min...
Which Certifications Should I Get?
As we mentioned in an earlier post, the old AWS slogan, “Cloud is the new normal” is indeed a reality today. Really, cloud has been the new normal for a while now and getting credentials has become an increasingly effective way to quickly showcase your abilities to recruiters and compan...
How to Go Serverless Like a Pro
So, no servers? Yeah, I checked and there are definitely no servers. Well...the cloud service providers do need servers to host and run the code, but we don’t have to worry about it. Which operating system to use, how and when to run the instances, the scalability, and all the arch...
AWS Security: Bastion Hosts, NAT instances and VPC Peering
Effective security requires close control over your data and resources. Bastion hosts, NAT instances, and VPC peering can help you secure your AWS infrastructure. Welcome to part four of my AWS Security overview. In part three, we looked at network security at the subnet level. This ti...
Top 13 Amazon Virtual Private Cloud (VPC) Best Practices
Amazon Virtual Private Cloud (VPC) brings a host of advantages to the table, including static private IP addresses, Elastic Network Interfaces, secure bastion host setup, DHCP options, Advanced Network Access Control, predictable internal IP ranges, VPN connectivity, movement of interna...
Big Changes to the AWS Certification Exams
With AWS re:Invent 2019 just around the corner, we can expect some early announcements to trickle through with upcoming features and services. However, AWS has just announced some big changes to their certification exams. So what’s changing and what’s new? There is a brand NEW ...
New on Cloud Academy: ITIL® 4, Microsoft 365 Tenant, Jenkins, TOGAF® 9.1, and more
At Cloud Academy, we're always striving to make improvements to our training platform. Based on your feedback, we released some new features to help make it easier for you to continue studying. These new features allow you to: Remove content from “Continue Studying” section Disc...
AWS Security Groups: Instance Level Security
Instance security requires that you fully understand AWS security groups, along with patching responsibility, key pairs, and various tenancy options. As a precursor to this post, you should have a thorough understanding of the AWS Shared Responsibility Model before moving onto discussi...
Cloud Migration Risks & Benefits
If you’re like most businesses, you already have at least one workload running in the cloud. However, that doesn’t mean that cloud migration is right for everyone. While cloud environments are generally scalable, reliable, and highly available, those won’t be the only considerations dri...
Real-Time Application Monitoring with Amazon Kinesis
Amazon Kinesis is a real-time data streaming service that makes it easy to collect, process, and analyze data so you can get quick insights and react as fast as possible to new information. With Amazon Kinesis you can ingest real-time data such as application logs, website clickstre...