In this course, I’ll start with an overview of Azure Monitor metrics and logs and how to configure them. Next, I’ll show you how to install virtual machine agents that send additional data from VMs to Azure Monitor. Finally, I’ll show you how to create alerts that will notify you if potential problems are detected in your Azure resources.
Learning Objectives
- Configure Azure Monitor metrics and logs
- Install Azure Virtual Machine monitoring agents
- Configure Azure Monitor alerts
Intended Audience
- Azure administrators, architects, security engineers, developers, and data engineers
Prerequisites
- Basic knowledge of Azure Storage, Azure Virtual Machines, and role-based access control
Suppose you wanted to be notified if a virtual machine’s CPU usage went above 80%. You could easily get Azure Monitor to do this by creating an alert.
Go to a virtual machine, and click “Alerts” in the menu. Then click “Create alert rule”. An alert rule consists of three parts: a resource, a monitor condition, and an action group. In our example, the resource is the virtual machine (which is called “monitordemo” in my case), the monitor condition is “if CPU percentage is greater than 80%”, and the action group says to notify you by email.
Since we’re creating this alert rule from the virtual machine, the resource has already been selected. If we were creating the alert rule from Azure Monitor, then we would have needed to specify the VM as the resource.
Now we’re creating the monitor condition. It’s asking us which signal to use in the condition, and it gives us a list of metrics and log entries we can get from this virtual machine. If we type “cpu” in the search field, it narrows the list down to three signals. “Percentage CPU” is the one we want. Now we need to specify all of the details of the condition.
First, we have to say whether we want the threshold to be static or dynamic. Since we want to set it to 80%, it should be static because the threshold that will trigger the alert will always be 80%. A dynamic threshold, on the other hand, is much more sophisticated. It uses artificial intelligence to determine the appropriate threshold based on the past behavior of the VM. And this threshold value can change over time. We’ll leave it set to “Static”.
We’ll come back to the aggregation type in a minute. The operator should be “Greater than”, and the threshold value should be 80%.
In the “When to evaluate” section, it says to check every 1 minute, which means that every minute it will check to see if the threshold has been reached. The lookback period is set to 5 minutes. This means that it checks if the threshold was reached in the last 5 minutes. So, every minute, it checks if the threshold was reached in the last 5 minutes.
This is where the aggregation type becomes important. If we choose “Average”, then it will check whether the average CPU percentage over a 5-minute period was more than 80%. So, if the CPU was at 80% for one minute and at 70% for the rest of the 5-minute period, then that wouldn’t trigger the alert because the average over that period would only be 72%.
However, if we set the aggregation type to “Maximum”, then the alert would be triggered if the CPU was over 80% at any time during the 5-minute period. To choose between the two types, you have to decide when you should be notified. If the VM’s CPU spikes to 80% briefly and then goes back down to a reasonable level, do you really want to be notified, or would you just end up getting lots of notifications that aren’t important?
This preview graph can be a big help when you’re trying to see the effect of choosing one type versus another. It shows whether the alert would have been triggered over the past 6 hours if it had been in place. You can also change it to look back over a longer time period than 6 hours if you want.
If we set the aggregation type to “Average” and the lookback period to 15 minutes, the alert wouldn’t have been triggered because the percentage never reached the threshold value of 80%. However, if we change it to “Maximum”, then it would have been triggered because there were brief moments when it exceeded 80%. So, it’s usually better to set it to “Average”. But if we set the lookback period to 5 minutes again, then it shows that the alert would have been triggered, so you really have to think about what to set the aggregation type and the lookback period to.
For some conditions, it might make the most sense to use “Maximum” or “Minimum”. For example, suppose you created an alert that checks if the VM has less than 100MB of memory available because running out of memory would cause serious problems for your application. You’d want to be notified if the memory available ever got that low, so you’d set the aggregation type to “Minimum”.
Okay, now click the “Next” button to go to the Actions tab. This is where we tell it what action to take when the alert is triggered. Click “Create action group”. Select a resource group. We can leave the region as “Global”. Let’s call the action group “demo”. It also automatically filled in “demo” for the display name. This is what you’ll see in notifications, so you can make it more descriptive than the action group name if you want. We’ll just leave it as “demo”.
Click “Next” to go to the Notifications tab. This is where we specify who gets notified. Under “Notification type”, there are two options. “Email Azure Resource Manager Role” simply means send an email to all users who have been assigned to a particular role. For example, if you wanted to send an email to all of the owners of the subscription, you’d select this and then choose “Owner” for the role.
Notice that there aren’t very many options here. A better choice than “Owner” might be “Monitoring Contributor” or “Monitoring Reader”, but you’d have to assign that role to everyone who should receive these notifications.
The other way to do this is to select “Email/SMS message/Push/Voice”. Then you can enter the email addresses or phone numbers of the people you want to notify. The disadvantage of doing it this way is that you’d have to enter this information manually, and you’d have to do this for every action group you created.
If everyone’s email address is already in Azure Active Directory, then it would be much easier to specify a role, assuming all of the correct people have been assigned to that role. However, you wouldn’t be able to send notifications to people’s phones that way, so it depends on what you want to do. For this demo, I’ll just put in my own email address here. Then click “OK”.
Now we need to give this a name. I’ll call it “Guy” since I’m the one who’s being emailed. If more people needed to be notified, we’d have to create a separate notification for each of them. That’s another advantage of specifying a role instead of an email address. If all of the people who need to be notified have the same role, then you could take care of all of them with one notification setting.
Click “Next” to go to the Actions tab. This gives you the option to launch an action, such as an Azure function or a logic app, every time this alert gets triggered.
The ITSM option allows you to send alerts to other software systems. If your organization uses an IT Service Management system, such as ServiceNow or Microsoft System Center Service Manager, then you can send your notifications there so you can track all of your incidents in one place. To do that, you’d need to install the IT Service Management Connector and then select the ITSM option in the action group.
We’re not going to initiate any actions, so click “Review + create” and then “Create”.
Great. Now we have one action group that will send one email. If we wanted to reuse this action group for other alerts, we could do that instead of creating another action group every time.
Okay, now remember that we’re still creating the alert for Percentage CPU exceeding 80%, so we need to click “Next” to go to the “Details” tab.
Let’s name it “High CPU Load”. We also need to decide which severity level to assign to this alert. It’s set to “Informational” by default, but we might want to change it to something higher, such as “Warning” or “Error”. In this case, we could leave it as “Informational” because it’s not necessarily urgent if a VM’s CPU goes above 80%.
Click “Review + create” and “Create”. We need to go back to the “Alert rules” tab to see it. There it is.
Now to show you what happens when the alert gets triggered, I’m going to change the condition from 80% to 10%. That way, it’ll be much easier to trigger it. Click on the alert rule, and then click “Edit”. Now click on the condition. Change the threshold value to 10. And change the aggregation period to 1 minute, so the CPU only needs to be high for 1 minute to trigger the alert. Click “Done” and “Save”.
All right. To drive the CPU over 10%, I’ll restart the VM.
Here’s the email I received. In the subject line, it says, “High CPU Load”, which is what we named the alert. Down here, it gives you the details of why the alert was activated. The percentage CPU average over the last minute was greater than 10. Here, it tells you that the CPU was at 14.33%.
Do you remember when it asked us to come up with a display name for the action group? Well, this is the only place in the email where it shows the display name.
That’s it for this introduction to Azure Monitor metrics, logs, and alerts. Please give this course a rating, and if you have any questions or comments, please let us know. Thanks!
Guy launched his first training website in 1995 and he's been helping people learn IT technologies ever since. He has been a sysadmin, instructor, sales engineer, IT manager, and entrepreneur. In his most recent venture, he founded and led a cloud-based training infrastructure company that provided virtual labs for some of the largest software vendors in the world. Guy’s passion is making complex technology easy to understand. His activities outside of work have included riding an elephant and skydiving (although not at the same time).