Health Checks
Start course
1h 11m

Container orchestration is a popular topic at the moment because containers can help to solve problems faced by development and operations teams. However, running containers in production at scale is a non-trivial task. Even with the introduction of orchestration tools, container management isn’t without challenges. Container orchestration is a newer concept for most companies, which means the learning curve is going to be steep. And while the learning curve may be steep, the effort should pay off in the form of standardized deployments, application isolation, and more.

This course is designed to make the learning curve a bit less steep. You'll learn how to use Marathon, a popular orchestration tool, to manage containers with DC/OS.

Learning Objectives

  • You should be able to deploy Mesos and Docker containers
  • You should understand how to use constraints
  • You should understand how to use health checks
  • You should be familiar with App groups and Pods
  • You should be able to perform a rolling upgrade
  • You should understand service discovery and load balancing

Intended Audience

  • Sysadmins
  • Developers
  • DevOps Engineers
  • Site Reliability Engineers


To get the most from this course, you should already be familiar with DC/OS and containers and be comfortable with using the command line and with editing JSON.


Lecture What you'll learn
Intro What to expect from this course
Overview A review of container orchestration
Mesos Containers How to deploy Mesos containers
Docker Containers How to deploy Docker containers
Constraints How to constrain containers to certain agents
Health Checks How to ensure services are healthy
App Groups How to form app groups
Pods How to share networking and storage
Rolling Upgrades How to preform a rolling upgrade
Persistence How to use persistent storage
Service Discovery How to use service discovery
Load Balancing How to distribute traffic
Scenario Tie everything together
Summary How to keep learning

If you have thoughts or suggestions for this course, please contact Cloud Academy at


Welcome back. Health checks are a crucial part of hyperscale applications. Imagine you're running an application that's only serving up 500 codes. Or, even worse, imagine that you learned about it from your customers. Health checks can ensure that this sort of thing doesn't become a problem. In fact, Marathon can restart a container, if a health check fails.

Let's look at how to use health checks with Marathon, and maybe you recall that we used that mini TWiT application from earlier. Let's create an app based on that, to show what it looks like without a health check. Okay. So this is deploying. And in just a moment it will be in a running state. Okay, so there it is with a status of running.

Notice what happens when I mouse over the status bar. It says "1 unknown task of 1". Now what that means is that the health of this app is unknown because there's no health check. Okay, let's destroy this app, and there we go. Now, let's launch this same app again. Only this time, with a health check. Here's the JSON for mini TWiT with the health check.

The top half of the file is the same as the mini TWiT example that we deployed earlier in the course. The health checks are added at the bottom with health checks property, which is an array of checks. An individual health check consists of several properties. The first property here is the protocol. And this can be set to MESOS_HTTP, MESOS_HTTPS, MESOS_TCP, or COMMAND.

This will allow you to easily determine if a service is running by using MESOS_TCP to make sure that the port is open. If you're running a web application you can use MESOS_HTTP, or MESOS_HTTPS, and then this COMMAND one will allow you to have a command run on agent, where the container is running, and if the status code is zero, then the check was successful.

The next property here is grace period, which is used to delay running the health check. So in the example here, the check waits for 30 seconds after the container is running, and this is useful if you have a container that needs to initialize for a moment before it's healthy. The interval seconds is how long to wait between health checks.

The timeout seconds determines how long to wait for a response before considering the check failed. And in this example the TCP check on port 80, if it doesn't respond within five seconds, that check is considered a failure. Maximum consecutive failures determines how many failures in a row can occur before the scheduler can consider this container unhealthy, and it's up to each scheduler to determine what unhealthy means.

For Marathon, an unhealthy container would be killed and restarted. The port name is the name of the port to check if you're using TCP. Notice that the port is named up in the port mappings here. If you're using MESOS_HTTP or HTTPS, then in place of port name, you'll have a path option, which is the endpoint to check.

So, let's launch this app the same way we've launched all of the others. Okay. Now let's go check this out in the UI. Alright, this is going to take a moment to deploy, so I'm going to speed this up. And there it is in a running state. It's also a green bar here because it's a healthy status. Previously it was gray because it was unknown.

Hovering the mouse over, you can see that it shows "1 Healthy Task of 1" and that's because the health check has worked. So you might have wondered, what happens if the grace period isn't long enough for a container to become ready? And that's a great question. When you don't know how long a container is going to take to really initialize, you can use readiness checks.

I have another version of mini TWiT here, and this one has a readiness check. A readiness check's property should look very familiar to you, because it's similar to the health checks. In this example, it pulls the root URL for the container on port 80, and if the response is either 200 or 302, then that container is considered healthy.

Let's create this app. Okay. And after a moment this should become healthy, just like the previous example. And there it is. So, from our perspective as an end user just watching this, not much has changed, except this now doesn't have to wait a fixed amount of time before it's healthy. Now for me personally, code should be considered broken unless unit tests and integration tests prove that it's working.

Health checks are kinda the same for me. I consider every app that I deploy to be broken, unless health checks prove me wrong. I know that's a bit pessimistic, however, this rather bleak perspective is going to save you a lot of headaches. So, hopefully you take full advantage of the health checks, and maybe it will avoid a 3am wake up call, because of a service outage.

Alright, let's wrap up here and in the next lesson we'll cover application groups and dependencies.

About the Author
Learning Paths

Ben Lambert is a software engineer and was previously the lead author for DevOps and Microsoft Azure training content at Cloud Academy. His courses and learning paths covered Cloud Ecosystem technologies such as DC/OS, configuration management tools, and containers. As a software engineer, Ben’s experience includes building highly available web and mobile apps. When he’s not building software, he’s hiking, camping, or creating video games.