1. Home
  2. Training Library
  3. Containers
  4. Containers Courses
  5. Administering Kubernetes Clusters

Scheduling Pods


Kubernetes Administration
kubectl Tips
Course Summary
3m 41s
Start course
1h 8m

This Administering Kubernetes Clusters course covers the many networking and scheduling objectives of the Certified Kubernetes Administrator (CKA) exam curriculum.

You will learn a range of core practices such as Ninja kubectl skills, the ability to control where pods are scheduled, how to manage resources for long-lasting production environments, and controlling access to applications in a cluster.

This is a 6 part course made up of four lectures. If you are not familiar with Kubernetes, we recommend completing the Introduction to Kubernetes course and the Deploy a Stateless Application in a Kubernetes Cluster Lab before taking this course.

Learning Objectives 

  • Analyze some pro tips on how to effectively use Kubectl. What you learn here will be useful for administering a cluster and using Kubernetes in general.
  • Learn to be able to attract or repel pods from nodes or other pods. You can ensure pods run on nodes where they are intended to run and achieve other objectives such as high-availability by distributing pods across nodes.
  • Learn to think about using Kubernetes for the long term when you need to consider how you’ll manage and update resources.
  • Learn how to control internal and external access to applications running in a Kubernetes cluster.

Intended Audience 

  • Anyone that is interested in Kubernetes cluster administration. But many parts of this course appeal to a broader audience of Kubernetes users.
  • Individuals that may benefit from taking this course include System Administrators, DevOps Engineers, Cluster Administrators, and Kubernetes Certification Examinees.


To get the most from this course,

  • Have knowledge of the core Kubernetes resources including pods, and deployments.
  • Experience using the kubectl command-line tool to work with Kubernetes clusters.
  • An understanding of YAML and JSON file formats. You’ll probably already have this skill if you have the prior two. When working with Kubernetes it won’t take long until YAML files make an appearance.

Update - From kubectl version 1.18 the kubectl run command can no longer be used for creating deployments. kubectl create deployment or manifest files can be used as alternatives.


Configuring multiple schedulers in Kubernetes: https://kubernetes.io/docs/tasks/administer-cluster/configure-multiple-schedulers/

Speaker 1:          Kubernetes clusters can be made up of heterogeneous nodes where some nodes may have more resources than others. For example, some nodes may have blazing fast solid state drives while others may have the latest and greatest CPUs or GPUs to test. To make sure that your applications get the resources they need and meet your performance expectations, you need to control where the pods are scheduled in the cluster.

                    This lesson covers the different ways that you can control pod scheduling. This lesson starts by explaining DaemonSets and helps you decide when to use them. Next, how Taints and Tolerations are used to repel pods from nodes. Then the concept of NodeSelectors and Affinity are introduced for another way to control pod placement in the cluster. Lastly, we will briefly touch on a couple of special topics in the realm of pod scheduling. DaemonSets are a kind of resource in Kubernetes. They are similar to deployments in that they both create pods and are used for long running processes. DaemonSets are different in that they ensure that one part is running on each node in the cluster.

                    We will see later that you can control a DaemonSet to exclude a subset of nodes but you can usually think of a DaemonSet creating a part on each node. This is how the Kube Proxy Kubernetes cluster component is deployed to each node in the cluster. Kube proxy implements network rules for Kubernetes Services and each node needs to be aware of the rule so services are reachable from each node. That makes a DaemonSet the ideal choice for deploying Kube proxy. Another example is with cluster log agents. In this diagram, the Kubernetes API server is in the middle while the nodes are represented on the right. Say that you wanted to use Fluentd as a log aggregation framework to collect all the logs in the cluster, you would need one agent running on each node.

                    An easy way to accomplish this is by writing a DaemonSet manifest file for Fluentd pods. You use the kubectl create command to request the API server to create the DaemonSet, and as a result each node gets one replica of the pod. When you need to schedule one copy of a pod on each node, use DaemonSets. The next scheduling topic we'll discuss is Taints and Tolerations. Taints are similar to node labels but they influence the pod scheduling decisions. Taints repel pods from notes. Any pod that is scheduled onto a tainted node must have a toleration for the taint. That is enough to understand the concept, but we'll see in a demo how to use Taints and Tolerations in practice. You use Taints and Tolerations together to ensure that pods are only scheduled onto appropriate nodes in a cluster.

                    One example that is automatically implemented in Kubernetes is detained masternodes to prevent pods you create from being scheduled on them. In this diagram, assume the top node is the master and it has a taint. If you create a pod without any tolerations, it will be repelled by the master and can only be scheduled on other nodes in the cluster. If you include a toleration for the taint, then the pod is eligible to be scheduled on the master. It does not require the pod to be scheduled on the master, only that it won't be repelled away from the master. It could still be scheduled onto the other nodes in the cluster. Let's take a look at these ideas in a demo. We'll review the example we just covered in my demo Kubernetes cluster and then I'll illustrate how to add taints and tolerations to influence scheduling of your own applications Kubernetes.

                    First, let's describe the masternode, which is the node with the name ending with 100. Near the top of the upper, we can see a taint that is automatically applied to the master. Taints look similar to labels but they have an effect associated with them. Here the taint key is node-role.kubernetes.io/master which is the same as a label on the node. Just like the label, there is no value defined for the key. You can assign values using an equal sign just like you would for labels. Using values can give you more granularity over scheduling. For example, you can tolerate specific values of a taint but not others. In the case of a masternode, it is really a binary decision, is it a master or not? So no value is defined. Then at the end after a colon, is the associated effect. In this case the effect is NoSchedule, which means do not schedule any new pods that don't tolerate this taint.

                    Other effects are prefer NoSchedule which will allow scheduling to the node if there are no other nodes that can schedule the pod, and NoExecute which will not allow pods to be scheduled and will also evict any pods that had already been scheduled onto the nod. If we list the pods in the Kube System namespace with wide output, we can see several pods on the masternode. Let's take a look at the CoreDNS pods as an example. The CoreDNS pods are created through a deployment, so I'll look at the deployment template to see if there are any tolerations and there it is, a toleration for the master NoSchedule taint. That is how the pod is able to schedule onto the master. If we now look at a pod for kube proxy which is deployed by a DaemonSet, we can see that there are several tolerations that ensure the pod will be eligible to schedule on all nodes including when there are limited resources available.

                    That is what the memory and disk-pressure tolerations do. Kubernetes will automatically taint a node when it runs low on resources, and these tolerations allow DaemonSets to be scheduled despite that. DaemonSets are automatically created with these tolerations. Furthermore, DaemonSet pods are actually scheduled by the DaemonSet controller rather than the normal cluster scheduler by default. That allows the pods to circumvent some other conditions that could otherwise prevent the pods from being scheduled onto nodes. Now let's try it out for ourselves. Currently, neither of the worker nodes have any taints, we will taint 102 worker node with a NoSchedule taint and then create a deployment without any tolerations and one with a toleration for the new taint.

                    To start, kubectl has a taint command for adding taints to nodes. The Help page includes some examples and a quick rundown if you ever forget how to use taints. Note that the taint command like many kubectl commands supports the -L or label selector option to apply taints to multiple nodes with matching labels. Let's use the taint command to create a priority equals high key value pair and a NoSchedule effect. Now only pods that tolerate the high priority taint can be scheduled onto the node. This example could be used to reserve resources for high priority workloads. Now let's create an agent next deployment in a scheduling namespace with five replicas and see where the pods land. I won't use a manifest file this time but I will need one later to specify tolerations.

                    Check the output of get in the scheduling namespace to see where the pods are scheduled. Every one of them landed on the 101 node, the only node without any taints. Let's delete the deployment and see what happens with the high priority toleration. First, let's check the toleration explain output to see how tolerations are defined. Note that tolerations is an array, so you can have as many as you need. For our toleration, we need to set the key value and effect to match the taint. The operator field is used to control if the value is checked or not. The default is equals which will tell us that the toleration value matches the taint value. That's what we want, since we want to match high priority, not any value of priority. There's also a tolerationSeconds field for setting how long to tolerate a NoExecute taint, we won't need that either.

                    I'll create a manifest by using the run command. In the manifest, I'll set the namespace and add in the toleration and template spec mapping which defines the PodSpec. Now let's create the deployment, and let's check where the pods got scheduled this time. This time we see some pods landing on the tainted 102 node thanks to the toleration. I'll delete the deployment now and remove the taint from the 102 node. You remove a taint in a similar way to how you remove a label by specifying the key and the effect with the minus sign appended at the end. That's all for this demo.

                    The next topic is the opposite of taints, rather than repelling pods, these concepts attract pods to nodes. The original method for attracting pods to nodes is by using a nodeSelector. A nodeSelector is a list of labels included in a PodSpec that must match in nodes labels for the pod to be scheduled. To visualize nodeSelector, imagine assigning a set of labels to one or more nodes, in this case only the middle node has the labels but you could label as many nodes as you want to be included in the target group of nodes. Now to control that a pod is scheduled onto one of the nodes with the set of labels in the PodSpecs nodeSelector list. When the pod is created, the scheduler will only schedule the pod onto nodes matching the labels in the pods nodeSelector list.

                    A new method of attracting pods to nodes and one that will eventually deprecate nodeSelectors is Node Affinity. Node Affinity is more expressive than nodeSelectors. Instead of only allowing exact matches of all labels in a set as nodeSelectors are limited to, node affinity can use a variety of operators. For example, you can require a node to have at least one label within a set using the In Operator or enforce that a node does not have any labels in a set using the NotIn Operator. The latter is actually a kind of anti-affinity and is similar to a taint in repelling pods from nodes. DoesNotExist also provides an anti-affinity. Node affinities can also express preferences rather than strict requirements. This allows you to prefer to schedule pods on a set of nodes but if they are not able to, the pods can be scheduled on other nodes.

                    Node Affinity will eventually deprecate nodeSelectors because node affinity can express everything that a nodeSelector can, and has additional flexibility. Let's consider an example, the nodes have each been labeled with different zones and the top and bottom nodes are labeled as as having SSDs. You can write a PodSpec that requires the pod to be scheduled in the orange or red zone, and within those two zones prefer to schedule the pod on nodes with solid state drives. When the pod is created, the scheduler will first limit the eligible nodes to the orange and red zones, and then as long as the node in the red zone has capacity, the pod will be scheduled to the SSD node in the red zone. If the red zone node couldn't accept the pod, it would be scheduled on the orange zone node. Now we will take a quick look at how you would create a deployment that specifies the affinity fields in the example we just considered.

                    For this demo I have written out the spec in advance. It is similar to the toleration deployment in the last demo but with a node affinity field added. Remember to use the Explain command if you can't recall the lengthy field names involved with node affinity. Let's go through the different fields. The node affinity field is in the affinity mapping. We'll see later that there are other kinds of affinity besides node affinity. Under node affinity there are two fields required during scheduling, ignored during execution, and preferred during scheduling, ignored during execution. The lengthy names are fairly self-explanatory, the required conditions are put in the first one while the preferred conditions are placed in the second. There is a plan to have a required during scheduling, required during execution field but currently node affinity will not evict pods after they're scheduled.

                    Under the required field there's always a single field name named nodeSelector terms which is comprised of a list of terms. One or more of the terms must be satisfied, that is the terms are combined using a Logical OR. It's important to note that not all the conditions need to be satisfied as long as one is satisfied, the requirements are met. The terms are defined using matchExpressions. The matchExpressions are where you express the conditions on node labels. Each expression must have a key and an operator while the values array is required and must not be empty for the In and NotIn Operators. Since we require the pods to be scheduled in the orange or red zones, the In operator is used and the accepted values for the zone key are orange or red.

                    If you use more than one matchExpression, each expression must be met. That is the expressions are combined using a Logical AND that is different from the nodeSelector terms which only require one of the array of matchExpressions to be satisfied. Switching our attention to the preferred node affinity conditions, we see that it is an array of preferences. Each preference has a weight that sets the relative importance of the preference. The scheduler computes a score for eligible nodes by adding up the weights and selects the node with the highest score to schedule the pod. Along with the weight, a preference field is given. The preference field consists of the same matchExpressions we saw before in the required conditions. In this case we prefer the node hardware to have an SSD.

                    With the spec node affinity set like this, the pod would be scheduled as we saw in the visualization. Some labels allow you to require pods to be in different groups of data centers but they can't ensure that pods are evenly distributed across zones to help achieve high availability. That is where the other kind of affinity comes in. Pod affinity is similar to node affinity in that you can express requirements and preferences for the nodes to schedule pods on. They also support the same operators like In and DoesNotExist but there are several differences. First, the conditions are on pod labels, the conditions are evaluated using the labels of pods running on each node. This makes pod affinity more computationally expensive compared to node affinity. Because of this, it is not recommended to use pod affinity for large clusters with a few hundred nodes or more.

                    Also, because pods are namespaced, their labels are implicitly namespaced, so the conditions include namespaces for where they apply. To allow maximum flexibility, there are separate top level fields for pod affinity and pod anti-affinity. After the conditions are evaluated, a topology key is used to decide which node to schedule the pod to among the nodes that have pods with labels satisfying the specified conditions. The topology key usually corresponds to a physical domain such as a zone, server rack, or cloud region. This allows you to spread pods across zones or regions for high availability or ensure that pods are scheduled in the same zone or region for performance reasons. The host name label that is automatically added by Kubernetes is also useful to control scheduling pods out of granularity of individual nodes.

                    One thing to remember is that every node in the cluster must define the topology key label or unexpected behavior kind of care. Let's go through an example of how Pod affinity works. The nodes in the clusters have labels for their zone either red or orange and another label for their hardware capabilities represented with white and gray. There are also two pods already scheduled on the nodes. One pod has a green label and another has a purple label. The pod labels could refer to different applications. For example, we require a pod to be scheduled in the same zone as a node running your green pod. For this condition we use pod affinity and set the topology key to zone. We also prefer that the pod is scheduled on a node with different hardware capabilities than nodes running purple pods. This could be to prefer to reserve those nodes resources for the purple pods as long as possible.

                    For this we can use pod anti-affinity and set the topology key hardware capability. Now when you create the pod, the scheduler evaluates the required conditions and finds a green pod in the red zone. So the new pod is eligible to be scheduled in the red zone. The scheduler then evaluates the preferences and finds a purple pod on white hardware, so the new pod will be preferred on nonwhite hardware in a red zone. The preferences and requirements can both be satisfied by scheduling the pod on the bottom node. If the bottom node wasn't available to schedule the pod, the pod would be scheduled to the top node to satisfy the requirement, although, the preferences couldn't be fulfilled. We won't demo the use of pod affinity because the ammo is very similar to that of node affinity.

                    Just remember to use the Explain command if you need help remembering all of the fields involved. That brings us to the final grouping of special topics related to pod scheduling. We'll explore container resource requirements, static pods, and custom scheduling. It will be easier to explain these at the command line, so let's hop over to my Kubernetes cluster shell. Each container in a PodSpec can set resource requirements in terms of CPU and memory required. These requirements can be set as minimum and maximum values. Let's explain the pod.spec.containers.resources field to see this. The limits field is used to set the maximum resources a container is allowed to use while the request set the minimum required resources for the container. The requests are what's important for scheduling since the scheduler won't schedule a pod unless a node has enough capacity to meet the resource requests.

                    It is definitely a good practice to set the requests for pods so the scheduler can make better decisions and avoid overloading nodes. Let's take a look at the CoreDNS PodSpec to see how a resource map looks. For both limits and requests, you add CPU and memory key value pairs if you want to put a constraint on the resource. The CoreDNS pod requires a minimum of 70 mebibytes of memory and a maximum of 170. It only puts a minimum request on the CPU of 100 milly CPUs or one tenth of a CPU. The supported units are given in the explain output. The next special topic is static pods. Static pods are pods that are managed directly by the nodes kubelet and not through the Kubernetes' API server. This means that static pods are not scheduled by the Kubernetes scheduler and instead are always scheduled onto the node by the kubelet.

                    This is how several pods are started on the masternode. The troubleshooting in Kubernetes lab here on Cloud Academy goes into more details about this. For now I'll just show you that the kubelet has a pod manifest file option to point to static pod manifest files that will be created by the kubelet. This is usually reserved for running system pods to help bootstrap a cluster. For example, the default scheduler pod is initialized this way. Finally, I want to mention that if all these capabilities are on pod scheduling still cannot meet your requirements, then you can create your own scheduler. How to make one is outside of the scope of this course, but I've put a link in the documentation in case you are interested. You can deploy custom schedulers alongside the default scheduler in the kube system namespace and use two labels to inform Kubernetes that the deployed pod is a scheduler in the control planter.

                    Here you can see the labels are also used by the default scheduler. Once a custom scheduler is deployed, you can set the PodSpec scheduler name field to the name of the new scheduler to have the pod scheduled by the new scheduler. That brings us to the end of this lesson all about pod scheduling. We covered a lot of ground so let's quickly recap what we discussed. DaemonSets can be used to schedule a pod onto each node in the cluster. Taints repel pods from nodes while tolerations allow a pod to be scheduled onto tainted nodes. NodeSelectors limit the nodes eligible to run a pod based on exact matches to a list of labels. Node affinity is a more expressive way to attract pods to nodes. Pod affinity allows scheduling decisions based on pods already scheduled to nodes.

                    We also covered a few Special Topics starting with container resource requirements. The request field is how the scheduler decides if a pod can be scheduled onto a node based on available CPU and memory. Static pods allow a pod to run as soon as the kubelet is brought up bypassing any scheduler. Lastly, know that you can use custom schedulers if the built-in scheduling capabilities don't meet your needs. In the next lesson, we'll look at the various ways that you can update resources running in a Kubernetes cluster. Continue on to the next lesson to learn all about resource updates.

About the Author
Learning Paths

Logan has been involved in software development and research since 2007 and has been in the cloud since 2012. He is an AWS Certified DevOps Engineer - Professional, AWS Certified Solutions Architect - Professional, Microsoft Certified Azure Solutions Architect Expert, MCSE: Cloud Platform and Infrastructure, Google Cloud Certified Associate Cloud Engineer, Certified Kubernetes Security Specialist (CKS), Certified Kubernetes Administrator (CKA), Certified Kubernetes Application Developer (CKAD), and Certified OpenStack Administrator (COA). He earned his Ph.D. studying design automation and enjoys all things tech.