Introduction to Prometheus
In this course, we take a look at the Tsar of monitoring tools - Prometheus. Prometheus is the second hosted project in the Cloud Native Computing Foundation, right next to the container orchestrating software - Kubernetes. Prometheus is an open-sourced systems monitoring and alerting toolkit with additional capabilities in service discovery.
If you have any feedback relating to this course, feel free to contact us at firstname.lastname@example.org.
- Understand and define the Prometheus monitoring tool
- Learn the core features of the tool
- Breakdown and understand the core components of the service
- Learn how to set up node exporters and a Prometheus monitor
- DevOps engineers, site reliability engineers, and cloud engineers
- Anyone looking to up their monitoring expertise with an open-source monitoring tool
To get the most out of this course, you should have some familiarity with monitoring tools. Experience using a Terminal, Git, Bash, or Shell would be beneficial but not essential.
Welcome to the Components Overview. I hope you kept that previous diagram in mind because we're going to be jumping straight away into the PushGateway. Let's start things off with a scenario. Say you have an ever-growing abundance of jobs that have automated tasks for you. Your boss would like to evaluate the efficacy of them and how they're doing to see if the automation is worth it. Perfect, you know, just the component to use to get these metrics. But hold up just one minute, before we throw the PushGateway at every job we have to capture metrics, let's have a quick short pop quiz on some best practices involved with the PushGateway.
So what is the PushGateway best used for? Is it A, capturing system level info of instances? B, capturing service level jobs unrelated to systems? Or C, capturing available inventory, the latest graphics cards near our local stores? Joking aside, the correct answer is B.
The reason why the PushGateway is best suited for capturing service level jobs is because when we have an intermediary acting as their pusher for metrics, it can also be a single point of failure for those metrics. If we built our critical where it's on system level information and PushGateway failed, so would our alerting. That's why the PushGateway needs to act only as such for service short-lived jobs.
Furthermore, the native instance health monitoring that Prometheus generates is not generated on PushGateway instances. That's because the intended purpose of these metrics is for the jobs to push and then exit. This is also the last reason to answer B is correct.
Metrics pushed to the PushGateway are not destroyed which is what Prometheus naturally does. So they persist after the service level jobs are exited. Our other instances have their metrics removed when they are deleted. And this is typically what we want. But keep in mind at these metrics at the PushGateway are not. So they will be exposed to Prometheus forever until they are manually deleted via the PushGateway's API.
Essentially the PushGateway acts as a cache for these jobs for then the main Prometheus server to scrape from. Great. We now know that with the PushGateway is used for so let's move on to the main Prometheus server into detail. I'd like to start talking about the Prometheus server diagram in front of us. As it looks monolithic in nature but it's actually comprised of several components that make it what it is. But let's start talking about the retrieval worker.
The retrieval worker is what actually goes out and scrapes our end points for their metrics. Then we have our storage database, which connects to our local node storage. It's also known as a time series database and it was built off of the Gorilla time series database.
Lastly, we have our HTTP server. This server is for our dashboarding, our API and can plug into our alert manager. It's responses are also returned in JSON. But how is Prometheus configured? Let's check out an example of YAML configuration now. We're gonna break up this YAML configuration into four bits, starting with the global configuration.
Here we have scrape interval scrape timeout, evaluation interval, which says how often should we evaluate our rules defined and external labels. These external labels are what other servers would see when Prometheus is communicating with them.
Next, we have our rule files. Our rule files will be evaluated every one minute and have increased information as we've seen before in the core set. From there we have our statically defined scrape configs. Where we can specify specific jobs that we would like metrics to be collected from.
Here, this scrape config is just monitoring the Prometheus server itself every five seconds on the metrics path. And we can notice that it's located on the local host at 9090. This scrape config specifically targets three node exporters. We're going to be getting into further information about node exporters later. So don't worry about knowing about them right now.
The configuration is almost the same, except we're operating on ports 8080 to 8082. Combining them together, we get the full configuration here. And what's extremely nice, is that all this configuration is located in the web UI on our live Prometheus server. So let's explore that now.
We're going to be taking a quick tour of the Promethean's web UI now, so that you'll feel more comfortable when you scan at your own at the end of this course. Let's jump onto the landing page. This landing page has all we need. This is it. There's no more.
Here, we have our top level menu bars of alerts, graph, and status. There, we have our PromQL expression bar where we can execute PromQL statements against the metrics that we have collected. And lastly, we have our graph and console. Graph will give us a visual representation of the metrics that we're interested in. And console will give us a raw plain text format.
Let's take a look at some raw metrics now that this Prometheus server has exposed for itself. Here's the raw plain text format that Prometheus's exposing on its /metrics URI. This URI is exposing itself on itself for the Prometheus server, which means that it's scraping itself and collecting information about the instance.
Let's move on to some PromQL examples that we're going to see on the dashboard starting with go_info. "go_info" is just collecting information about the Go version that's being ran on in all of our instances. You'll see that we have four elements, one for each node, and one for the Prometheus version. You'll also see that there's included the labels for group, instance, job and version. This includes the Go version, the job that we're calling and the instance location.
If we have different metrics with the same dimensional labels you can apply binary operators to them. For example, this expression returns the total Go memory usage in mebibytes for these instances. You can also see the discovered labels and target labels under service discovery which is under the status page. We've seen this before but these are for the three extra node exporters.
Here is the YAML configuration that we've previously discussed, shown in the status configuration. It's a nice, easy way to see what is running on your live Prometheus server and it's easy to copy to clipboard. The second to last component of Prometheus is the alert manager. And it does exactly what it says. It handles alerts sent by client applications and sends them along their way and routes them to the correct receiver, whether that be email, PagerDuty or another application of your choice.
The alert manager has ways to group alerts into a single notification. So you're not alerted every single time an instance may be failing. It also has the ability to prevent alerts from triggering if another alert is going on. For example, if alert X is alerting, don't trigger alert Y. This could be useful in case if an entire cluster is down, we don't need to know all the specific instances within that cluster and their alerts.
There's ability to mute alerts for given period of time. And there's high availability with multi cluster support through a command line plug. Exporters are the cream of the crop for Prometheus. Exporters as their name sounds are open-sourced generally. They're tailored for Prometheuses and here's how they work. They grab metrics for Prometheuses. They alter those metrics into a Prometheus format using a client library such as Go, Java, Python, Ruby. And then they typically start a web server for themselves to be scraped upon.
In our upcoming example, we're going to be seeing that the node exporters expose their metrics on the same Prometheus's metrics, URI /metrics. There's a very extensive list of exporters already created. Which makes it easy to integrate with your deployment.
Okay. That's enough lecturing. Let's actually see an example Prometheus live and in action. If you wanna follow along and jump along with me go ahead and clone the rebuild that's been provided for this course and run the letsgo.bash script. This will get your Prometheus environment up and running with the Prometheus server and three node exporters, with those node exporters being scraped from Prometheus.
Jonathan Lewey is a DevOps Content Creator at Cloud Academy. With experience in the Networking and Operations of the traditional Information Technology industry, he has also lead the creation of applications for corporate integrations, and served as a Cloud Engineer supporting developer teams. Jonathan has a number of specialities including: a Cisco Certified Network Associate (R&S / Sec), an AWS Developer Associate, an AWS Solutions Architect, and is certified in Project Management.