Using AWS Glue for Athena Preparation Demo

The course is part of this learning path

Start course

This course explores how to visualize your costs in AWS and share them with others. We'll look at some of the tools you can use for this purpose, including the AWS Billing Service, AWS Cost and Usage Reports (CURs), AWS Glue, Amazon Athena, and Amazon QuickSight. You'll follow with guided demonstrations from the AWS platform showing you how to use these services.

Learning Objectives

  • Learn how to visualize your AWS costs
  • Get a practical understanding of how to use the AWS Billing Service, AWS Cost and Usage Reports (CURs), AWS Glue, Amazon Athena, and Amazon QuickSight

Intended Audience

This course is intended for anyone who wants to understand the tools available for visualizing costs in AWS


This course is beginner-friendly, however, to get the most out of it, it would be useful to have some knowledge of cost optimization theory, cloud & finance-related terminology, and a basic understanding of AWS Athena, AWS Glue, Amazon Quicksight, and SQL commands.


Athena is a serverless service that allows you to analyze data stored directly in data sources like Amazon S3 using standard SQL. To use Athena, you need to have your expenses recorded in a custom usage report and available in the parquet file format. Athena needs a data source to be able to process any data. For us this means that we need to prepare the data for the service first, in order to properly use it.

The whole process can be wrapped up in three simple steps. Step number one, prepare the CUR data for Athena by using AWS Glue. Step two, confirm the data tables in Athena. Step three, run queries in Athena according to your needs.

So let's have a look at our data source. This should already be set up by default but if for some reason it is not for you, this is how you can manually connect Athena to the data source. You basically choose the location which is the S3 service as this is where our CUR is located. We will leave it on the AWS Glue catalog and hit the next button. We want to use AWS Glue to automatically retrieve all the information needed, so we select this option and simply press connect to AWS Glue, which will lead us to the service itself.

In Glue, we need to prepare a database which is going to be used by Athena and other services. For this, we need to set up a crawler that will go through our CUR and get the data we need. So, first step, we basically want to use the AWS Glue crawler to prepare a database. For this, we first set up the crawler. Let's type in a name, for example, master crawler, there are some optional information and options, but we don't really need them right now. So let's just head on.

We have to make sure that we're excluding some file format so the crawler isn't busy with the false formats that don't really matter to us. And once this is done, we click on the next button. Oops I forgot to select S3 bucket and we can do this real quick by clicking on the icon on the side and just select the correct path. Click next and next again.

Next is to choose an IAM role and we will create a new one. And after that, we will choose the frequency, meaning how often the crawler will be used or run. In our example, the start time doesn't really matter. So we will just choose anything.

Next is to add the database. Here, we will just enter a name. And after that we've already finished. This is what I crawler settings look like. And then we just simply click finish. The next thing is to run the crawler and here we can watch the crawler create one or more tables. We can check our tables in the Database or directly in the Tables tab right here. The database we just created has the name costmaster. And here we got our table og_test. It's the same name as in our CUR.

We can see that the crawler got us all the information we're interested in and excluded all the files ending with dot Jason, dot YML and so on. This scheme below gives us an impression on what columns we have in the table. But to make it actually readable and clear, we need to run additional queries in Athena. So let's do that now.

Okay, so now after successfully verifying our CUR files and our S3 bucket, after using AWS Glue to crawl and to process our CUR bucket and build a database from it and after successfully extracting our tables to Athena, we are now ready to go to start actually using Athena.

So the first thing is to load your tables so that they will look just like mine here, with this petition tagline. To do this, you simply click on the free button menu and click on Load partitions. In my case, this is already done. But for you, before you actually start using the table, you should do this. All right.

After this is done, I think I can use the databases and run queries to analyze usage and cost in the form of a table. For this, we're going to use SQL to run queries according to our needs. I have prepared a few queries beforehand. So let me switch real quick to my scripts and I will show you what you can do. All right.

So what we're doing is we're using SQL to show all the content in the table. So just follow along with me. So we select all from the database costmaster. From the table og_test, and we want to limit it to 10 entries. So it basically won't show us more than 10 results. Okay, then we simply run the query.

And now you can basically see the content of the table with all the fields. You can see there are lots of different data, and it's rather hard to get an overview. But with the data in the table like it is now, we can't really work with it. So I will show you some more specific queries that are more usable for real life usage. But what could be interesting for now is that once you have created a query like this one, you want to save it so you could reuse it later on. To do this, you simply click on save as, give it a name, and for example, this will be show all content or even better we will name it, show all, and the description will be show all content and then we save it.

You see here query saved, and you can find it again here in the saved queries tab, under the name show all. And all the other queries I prepared beforehand you can see them here. And I will explain every single one to you so let's carry on with them. Let's start with something easy.

So I want to have a look at all the services that have been used in a certain time. So what I'm going to type in is the line item product code. And I want to see it as service, from our costmaster database and og_test table. And once again, I want to limit it to 10 entries. And now I can see all the services that have been recorded in the CUR.

Additionally, I can say I want to group them by their names, so I won't have to deal with any duplicates. And now I don't have any duplicates left and I only see the services as they appear in the CUR file. Next, I want to show the top 10 cost drivers. And for this, I just leave it as services and in addition, I will just add some, which will basically sum up all the unblended costs of this particular service cost. And additionally, I also want to round it up by two digits. Right, like this. Okay.

So I'm still using the table og_test, and I still want to group them by their item product code, but in addition, I also want to order them by service and I want it to be descending. All right. So there's my SQL script, my query. So let's run it and see what we get.

Okay. So I can tell that from the CUR file, my biggest cost driver is the AWS Cost Explorer worth $6.90 and the top 10 of my cost drivers are all these ones here. But I'm seeing a little error here because I don't want this column to be called service, but instead I want to call it costs to be more accurate. Right okay. And this is how it should look for you. This is a pretty basic query, I would say.

So let's now move on to something more interesting. Now let's set another column. I want to see the item description of my items that I have here. So I simply use line item, line item description, and just paste it right behind my line item product code. I do this for the select command and I will do the same for group by. The rest stays the same. And let's run the query. And here we can see now the description of each item. This is especially interesting if you were using the same service, but with different instances or a different setting like here.

So this is an interesting real life example that could be useful for you. You may also be interested to know about any costs that have been caused by a specific account. So for this purpose, I will use the field of line item usage, account ID and I will leave the rounding and the some of the line item unblended costs as it is. But I have to replace the group by, so I don't have any duplicates and everything else will stay as it is.

So in this account, I know I have three accounts and as expected, I can see the account numbers and the cost associated with each one. All right.

So for our next query, I want to have a look at EC2 instances. So for this, I'm going to use the where command and compare my line item product code so it matches Amazon EC2 which is my keyword here. I leave everything else as it was and now you can see I have all my EC2 items but with different types of instances as you can tell by their item line descriptions. And it even shows us the cost per hour, which can be useful for Showback or Chargeback purposes later on.

The next query is a little bit more complicated and longer as well. What is this query? I want to view the hourly usage and costs of my EC2 instances. So again, I'll be starting by selecting my a line item usage type, and I will add three columns this time, which will be usage, cost and hourly rate, which is the average of the line item unblended costs divided by the line item usage amount. Just watch me type the script and try to understand what I'm doing here.

Again, we are taking from our costmaster.og_test. And to focus only on the EC2 service we are doing, where line item product code is like Amazon EC2. And in addition, there's a line item usage type, which should equal usage. We will group this by line item usage type, and we're gonna reuse the line item usage type for the order. Let's cap it at 10 results. Oops I made a little mistake there, instead of dash, of course I wanted underscore right here. Okay.

So we can see. We can sort by usage or by cost or even by hourly rate and if we sort by usage, we can tell that we have been using the T2 micro instance family for more than 900 hours and we've paid exactly $0 for it. I wonder, can you guess why that is? Well, this is because we never reached our limit of the free tier. So know that if you want to use EC2, you can comfortably use a large amount of hours without paying a single cent for it. That is as long as you are only using resources that are covered by the free tier.

Now we've learned that with Amazon Athena, we can use a variety of SQL queries to sort out any details from CUR files that we might need. So, lastly, I want to give you a few ideas in case you want to present your files to your teammates or anyone else.

So another helpful idea will be to use queries to filter between tagged and untagged EC2 resources. The second point is to find out what accounts are actually using your available reserved instances. This will make it easier for you to allocate costs across teams more evenly. The next one is a very popular, low hanging fruit, which is to find reserved instances that are actually not unused, but are still reserved and therefore it still costs you money on the regular.

And lastly, you can use a query to find out which instance family of EC2 instances are used the most and if you could possibly optimize that to a newer instance family. So in any case, Athena brings you a lot of deeper insights into your CUR data. So go ahead and try queries that fit your needs.

About the Author

Heinrich Marks and is a FinOps Consultant at kreuzwerker Frankfurt GmbH, Germany. His expertise lies in AWS cost optimization and FinOps methodology at the enterprise level. 
At kreuzwerker, we focus on AWS, Software Engineering, and Cloud Financial Management. We've faced many cases with our customers where costs in the cloud got out of hand and lost touch with the actual business values. This made us work harder on solutions and find ways to master financial management in the cloud. 
Over time, we have already been able to save our customers many millions of dollars. Moreover, we are also proud partners of the FinOps Foundation and the leading professionals in Germany. Today, we want to share this knowledge with you and spread more awareness on the topic!