Using Kubernetes affinity rules to control where your pods are scheduled


Kubernetes has truly revolutioned distributed computing. While it solves a number of super hard problems, it also adds a number of new challenges. One of these challenges is ensuring your Kubernetes clusters are designed with failure domains in mind. Designing around failure domains includes things like provisioning infrastructure across availability zones, ensuring your physical servers are in different racks, or making sure the pods that support your application don’t wind up on the same physical Kubernetes worker.

Inter-pod affinity and anti-affinity rules can be used to address the last point, and the official Kubernetes documentation does a really good job of describing them:

“Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. The rules are of the form “this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y”. Y is expressed as a LabelSelector with an optional associated list of namespaces; unlike nodes, because pods are namespaced (and therefore the labels on pods are implicitly namespaced), a label selector over pod labels must specify which namespaces the selector should apply to. Conceptually X is a topology domain like node, rack, cloud provider zone, cloud provider region, etc. You express it using a topologyKey which is the key for the node label that the system uses to denote such a topology domain; for example, see the label keys listed above in the section Interlude: built-in node labels."

Affinities can be defined with an affinity statement in a deployment manifest. So given a 3-node cluster:

$ kubectl get nodes

NAME                 STATUS   ROLES    AGE   VERSION
test-control-plane   Ready    master   22d   v1.18.2
test-worker          Ready    <none>   22d   v1.18.2
test-worker2         Ready    <none>   22d   v1.18.2
test-worker3         Ready    <none>   22d   v1.18.2

You can create an affinity rule by adding an affinity stanza to the pods spec:

$ cat nginx.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
      affinity:
        podAntiAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:
             - labelSelector:
                 matchExpressions:
                   - key: "app"
                     operator: In
                     values:
                     - nginx
               topologyKey: "kubernetes.io/hostname"

There is a lot going on in the affinity section, so I will break down each piece. Affinity provides the following 3 scheduling constraints:

$ kubectl explain pod.spec.affinity

KIND:     Pod
VERSION:  v1

RESOURCE: affinity <Object>

DESCRIPTION:
     If specified, the pod's scheduling constraints

     Affinity is a group of affinity scheduling rules.

FIELDS:
   nodeAffinity <Object>
     Describes node affinity scheduling rules for the pod.

   podAffinity  <Object>
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).

   podAntiAffinity  <Object>
     Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod
     in the same node, zone, etc. as some other pod(s)).

In the example above I am using the podAntiAffinity rule, which can be used to avoid placing two similar pods together. The labelSelector map contains an expression to match pods that will have the affinity rules applied to it. And lastly, the topologyKey is used to specify the item that you want the affinity rule applied to. In this example I specified the hostname topology key, which will prevent two pods that match the labelSelector to be placed on a single node.

Once this deployment is created, we can verify that each pod was scheduled to a unique worker node:

$ kubectl get po -o wide

NAME                     READY   STATUS    RESTARTS   AGE   IP          NODE           NOMINATED NODE   READINESS GATES
nginx-75db5d94dc-4w8q9   1/1     Running   0          72s   10.11.3.2   test-worker3   <none>           <none>
nginx-75db5d94dc-5wwm2   1/1     Running   0          72s   10.11.1.5   test-worker    <none>           <none>
nginx-75db5d94dc-cbxs5   1/1     Running   0          72s   10.11.2.2   test-worker2   <none>           <none>

But with with any affinity implementation, there are always subtleties you need to be aware of. In the example above, what happens if you need to scale the deployment to handle additional load? We can see what happens first hand:

$ kubectl scale deploy nginx --replicas 6

If we review the pod list:

$ kubectl get po

NAME                     READY   STATUS    RESTARTS   AGE
nginx-75db5d94dc-2sltl   0/1     Pending   0          21s
nginx-75db5d94dc-4w8q9   1/1     Running   0          14m
nginx-75db5d94dc-5wwm2   1/1     Running   0          14m
nginx-75db5d94dc-cbxs5   1/1     Running   0          14m
nginx-75db5d94dc-jxkqs   0/1     Pending   0          21s
nginx-75db5d94dc-qzxmb   0/1     Pending   0          21s

We see that the new pods are stuck in the Pending state. That’s because we only have three nodes, and the affinity rule will prevent two pods that are similar from being scheduled to the same node. The Kubernetes scheduler does a solid job out of the box, but sometimes you need a bit more control over where your pods wind up. This is especially the case when you are using multiple availability zones in the “cloud”, and need to ensure that pods get distributed between them. I will loop back around to this topic in a future post where I’ll discuss zone topology keys and spread priorities.

This article was posted by on 2020-08-03 02:00:00 -0500 -0500