Kubernetes has truly revolutioned distributed computing. While it solves a number of super hard problems, it also adds a number of new challenges. One of these challenges is ensuring your Kubernetes clusters are designed with failure domains in mind. Designing around failure domains includes things like provisioning infrastructure across availability zones, ensuring your physical servers are in different racks, or making sure the pods that support your application don’t wind up on the same physical Kubernetes worker.
Inter-pod affinity and anti-affinity rules can be used to address the last point, and the official Kubernetes documentation does a really good job of describing them:
“Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. The rules are of the form “this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y”. Y is expressed as a LabelSelector with an optional associated list of namespaces; unlike nodes, because pods are namespaced (and therefore the labels on pods are implicitly namespaced), a label selector over pod labels must specify which namespaces the selector should apply to. Conceptually X is a topology domain like node, rack, cloud provider zone, cloud provider region, etc. You express it using a topologyKey which is the key for the node label that the system uses to denote such a topology domain; for example, see the label keys listed above in the section Interlude: built-in node labels."
Affinities can be defined with an affinity statement in a deployment manifest. So given a 3-node cluster:
kubectl get nodes
NAME STATUS ROLES AGE VERSION test-control-plane Ready master 22d v1.18.2 test-worker Ready <none> 22d v1.18.2 test-worker2 Ready <none> 22d v1.18.2 test-worker3 Ready <none> 22d v1.18.2
You can create an affinity rule by adding an affinity stanza to the pods spec:
apiVersion: apps/v1 kind: Deployment metadata: labels: app: nginx name: nginx spec: replicas: 3 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - image: nginx name: nginx affinity: podAntiAffinity: requiredDuringSchedulingIgnoredDuringExecution: - labelSelector: matchExpressions: - key: "app" operator: In values: - nginx topologyKey: "kubernetes.io/hostname"
There is a lot going on in the affinity section, so I will break down each piece. Affinity provides the following 3 scheduling constraints:
kubectl explain pod.spec.affinity
KIND: Pod VERSION: v1 RESOURCE: affinity <Object> DESCRIPTION: If specified, the pod's scheduling constraints Affinity is a group of affinity scheduling rules. FIELDS: nodeAffinity <Object> Describes node affinity scheduling rules for the pod. podAffinity <Object> Describes pod affinity scheduling rules (e.g. co-locate this pod in the same node, zone, etc. as some other pod(s)). podAntiAffinity <Object> Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod in the same node, zone, etc. as some other pod(s)).
In the example above I am using the podAntiAffinity rule, which can be used to avoid placing two similar pods together. The labelSelector map contains an expression to match pods that will have the affinity rules applied to it. And lastly, the topologyKey is used to specify the item that you want the affinity rule applied to. In this example I specified the hostname topology key, which will prevent two pods that match the labelSelector to be placed on a single node.
Once this deployment is created, we can verify that each pod was scheduled to a unique worker node:
kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES nginx-75db5d94dc-4w8q9 1/1 Running 0 72s 10.11.3.2 test-worker3 <none> <none> nginx-75db5d94dc-5wwm2 1/1 Running 0 72s 10.11.1.5 test-worker <none> <none> nginx-75db5d94dc-cbxs5 1/1 Running 0 72s 10.11.2.2 test-worker2 <none> <none>
But with with any affinity implementation, there are always subtleties you need to be aware of. In the example above, what happens if you need to scale the deployment to handle additional load? We can see what happens first hand:
kubectl scale deploy nginx --replicas 6
If we review the pod list:
kubectl get po
NAME READY STATUS RESTARTS AGE nginx-75db5d94dc-2sltl 0/1 Pending 0 21s nginx-75db5d94dc-4w8q9 1/1 Running 0 14m nginx-75db5d94dc-5wwm2 1/1 Running 0 14m nginx-75db5d94dc-cbxs5 1/1 Running 0 14m nginx-75db5d94dc-jxkqs 0/1 Pending 0 21s nginx-75db5d94dc-qzxmb 0/1 Pending 0 21s
We see that the new pods are stuck in the Pending state. That’s because we only have three nodes, and the affinity rule will prevent two pods that are similar from being scheduled to the same node. The Kubernetes scheduler does a solid job out of the box, but sometimes you need a bit more control over where your pods wind up. This is especially the case when you are using multiple availability zones in the “cloud”, and need to ensure that pods get distributed between them. I will loop back around to this topic in a future post where I’ll discuss zone topology keys and spread priorities.