Kubernetes has truly revolutioned distributed computing. While it solves a number of super hard problems, it also adds a number of new challenges. One of these challenges is ensuring your Kubernetes clusters are designed with failure domains in mind. Designing around failure domains includes things like provisioning infrastructure across availability zones, ensuring your physical servers are in different racks, or making sure the pods that support your application don’t wind up on the same physical Kubernetes worker.
Inter-pod affinity and anti-affinity rules can be used to address the last point, and the official Kubernetes documentation does a really good job of describing them:
“Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. The rules are of the form “this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y”. Y is expressed as a LabelSelector with an optional associated list of namespaces; unlike nodes, because pods are namespaced (and therefore the labels on pods are implicitly namespaced), a label selector over pod labels must specify which namespaces the selector should apply to. Conceptually X is a topology domain like node, rack, cloud provider zone, cloud provider region, etc. You express it using a topologyKey which is the key for the node label that the system uses to denote such a topology domain; for example, see the label keys listed above in the section Interlude: built-in node labels."
Affinities can be defined with an affinity statement in a deployment manifest. So given a 3-node cluster:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
test-control-plane Ready master 22d v1.18.2
test-worker Ready <none> 22d v1.18.2
test-worker2 Ready <none> 22d v1.18.2
test-worker3 Ready <none> 22d v1.18.2
You can create an affinity rule by adding an affinity stanza to the pods spec:
$ cat nginx.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
labels:
app: nginx
name: nginx
spec:
replicas: 3
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- image: nginx
name: nginx
affinity:
podAntiAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
- labelSelector:
matchExpressions:
- key: "app"
operator: In
values:
- nginx
topologyKey: "kubernetes.io/hostname"
There is a lot going on in the affinity section, so I will break down each piece. Affinity provides the following 3 scheduling constraints:
$ kubectl explain pod.spec.affinity
KIND: Pod
VERSION: v1
RESOURCE: affinity <Object>
DESCRIPTION:
If specified, the pod's scheduling constraints
Affinity is a group of affinity scheduling rules.
FIELDS:
nodeAffinity <Object>
Describes node affinity scheduling rules for the pod.
podAffinity <Object>
Describes pod affinity scheduling rules (e.g. co-locate this pod in the
same node, zone, etc. as some other pod(s)).
podAntiAffinity <Object>
Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod
in the same node, zone, etc. as some other pod(s)).
In the example above I am using the podAntiAffinity rule, which can be used to avoid placing two similar pods together. The labelSelector map contains an expression to match pods that will have the affinity rules applied to it. And lastly, the topologyKey is used to specify the item that you want the affinity rule applied to. In this example I specified the hostname topology key, which will prevent two pods that match the labelSelector to be placed on a single node.
Once this deployment is created, we can verify that each pod was scheduled to a unique worker node:
$ kubectl get po -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
nginx-75db5d94dc-4w8q9 1/1 Running 0 72s 10.11.3.2 test-worker3 <none> <none>
nginx-75db5d94dc-5wwm2 1/1 Running 0 72s 10.11.1.5 test-worker <none> <none>
nginx-75db5d94dc-cbxs5 1/1 Running 0 72s 10.11.2.2 test-worker2 <none> <none>
But with with any affinity implementation, there are always subtleties you need to be aware of. In the example above, what happens if you need to scale the deployment to handle additional load? We can see what happens first hand:
$ kubectl scale deploy nginx --replicas 6
If we review the pod list:
$ kubectl get po
NAME READY STATUS RESTARTS AGE
nginx-75db5d94dc-2sltl 0/1 Pending 0 21s
nginx-75db5d94dc-4w8q9 1/1 Running 0 14m
nginx-75db5d94dc-5wwm2 1/1 Running 0 14m
nginx-75db5d94dc-cbxs5 1/1 Running 0 14m
nginx-75db5d94dc-jxkqs 0/1 Pending 0 21s
nginx-75db5d94dc-qzxmb 0/1 Pending 0 21s
We see that the new pods are stuck in the Pending state. That’s because we only have three nodes, and the affinity rule will prevent two pods that are similar from being scheduled to the same node. The Kubernetes scheduler does a solid job out of the box, but sometimes you need a bit more control over where your pods wind up. This is especially the case when you are using multiple availability zones in the “cloud”, and need to ensure that pods get distributed between them. I will loop back around to this topic in a future post where I’ll discuss zone topology keys and spread priorities.
Ansible has amazing support for testing services during playbook execution. This is super useful for validating your services are working after a set of changes take place, and when combined with serial you can stop execution if a change negatively impacts one one or more servers in your fleet. Ansible has a number of modules that can be used to test services, including the uri module.
The uri module allows Ansible to interact with a web endpoint, and provides numerous options to control its behavior. When I apply OS updates to my kubelets, I typically use the reboot module along with uri to verify that the kubelet healthz endpoint is returning a 200 status code after updates are applied:
- name: Reboot the server to pick up a kernel update
reboot:
reboot_timeout: 600
- name: Wait for the kubelet healthz endpoint to return a 200
uri:
url: "http://{{ inventory_hostname }}:10256/healthz"
method: GET
status_code: 200
return_content: no
register: result
until: result.status == 200
retries: 60
delay: 1
In the example above, the uri module issues an HTTP GET to the kubelet’s healthz endpoint, and checks the response for a 200 status code. It will also retry the GET operation 60 times, waiting one second between each request. This allows you to update one or more hosts, test them after the change is made, then move on to next system if the service running on that system is healthy. If the update breaks a service, you can fail the playbook run immediately. Good stuff!
As a Kubernetes administrator I frequently find myself needing to debug application and system issues. Most of the issues I encounter can be solved with Grafana dashboards and Prometheus metrics, or by running one or more Elasticsearch queries to examine logs. But there are times when I need to go deeper and actually inspect activity inside a running pod. A lot of debugging guides use the kubectl exec command to run one or more commands inside a container:
$ kubectl exec -it container-XXXX dig @10.10.0.1 google.com
But what happens if you don’t have a shell installed in the container? Or what if your container runs as an unprivileged user (which it should), and the tools you need to debug the issue aren’t installed? Kinda hard to install utilities if you don’t have root, and it defeats the whole point of ephemeral infrastructure. In these situations the Linux nsenter command will become your best friend!
If you aren’t familiar with nsenter, it allows you to run a program in a given namespace. So lets say you have a microservice running in your Kubernetes cluster, and your developers tell you that DNS resolution isn’t working correctly. To debug this issue with nsenter, you can access the host the service is running on, and execute nsenter with the “-t” (process to target) and “-n” (enter the network namespace) options. The final argument is the command to run in the processes network namespace:
$ nsenter -t 1294 -n dig +short @10.11.2.2 *.*.svc.cluster.local
10.10.0.10
10.10.0.1
In the example above, nsenter ran the dig command again the cluster DNS service IP. It also used the dig binary that resides on the hosts file system, not the containers. Nsenter is also super helpful when you need to capture traffic going in and out of a container:
$ nsenter -t 1294 -n tcpdump -i eth0 port 80 and "tcp[tcpflags] & tcp-syn != 0"
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
06:04:20.424938 IP 10.11.2.1.39168 > 10.11.2.3.80: Flags [S], seq 1491800784, win 29200, options [mss 1460,sackOK,TS val 59669904 ecr 0,nop,wscale 7], length 0
06:04:20.425000 IP 10.11.2.3.80 > 10.11.2.1.39168: Flags [S.], seq 3823341284, ack 1491800785, win 28960, options [mss 1460,sackOK,TS val 59669904 ecr 59669904,nop,wscale 7], length 0
In the example above, nsenter executed the tcpdump utility inside process ID 1294’s namespace. What makes this super powerful is the fact that you can run your containers with the minimum number of bits needed to run your application, and your application can also run as an unprivileged user. When you need to debug issues you don’t need to touch the container. You just fire up the binary on your Kubernetes worker and debug away.
This week I was updating some Ansible application and OS update playbooks. By default, when you run ansible-playbook it will apply your desired configuration to hosts in the order they are listed in the inventory file (or in the order they are returned by a dynamic inventory script). But what if you want to process hosts in a random order? Or by their sorted or reverse sorted names? I recently came across the order option, and was surprised that I didn’t notice this before.
The order option allows you to control the order ansible operates on the hosts in your inventory. It currently has five options:
inventory The default. The order is ‘as provided’ by the inventory
reverse_inventory As the name implies, this reverses the order ‘as provided’ by the inventory
sorted Hosts are alphabetically sorted by name
reverse_sorted Hosts are sorted by name in reverse alphabetical order
shuffle Hosts are randomly ordered each run
So if you want to process your hosts in a random order, you can pass “shuffle” to option:
---
- hosts: "{{ server_list }}"
become: true
serial: 1
order: shuffle
tasks:
- name: Upgrade Operating System packages
yum:
name: '*'
state: latest
register: yum_updates_applied
...
This is super useful for large clusters, epsecially those that have hosts grouped by functional purpose. Nifty option!
I talked previously about needing to decode docker HTTP headers to debug a registry issue. That debugging session was super fun, but I had a few questions about how that interaction actually works. So I started to decode all of the HTTP requests and responses from a $(docker pull), which truly helped me solidify how the docker daemon (dockerd) talks to a container registry. I figured I would share my notes here so I (as well as anyone else on the ‘net) can reference them in the future.
Here are the commands I ran prior to reviewing the client / server interactions:
$ docker login harbor
$ docker pull harbor/nginx/ingress:v1.0.0
There are a couple of interesting bits in these commands. First, the docker CLI utility doesn’t actually retrieve a container image. That job is delegated to the docker server daemon (dockerd). Second, when you type docker login, it will authenticate to the registry and cache your credentials in $HOME/.docker/config.json by default. Those are then used in future requests to the container registry.
Now on to the HTTP requests and responses. The first GET issued by dockerd is to the /v2/ registry API endpoint:
GET /v2/ HTTP/1.1
Host: harbor
User-Agent: docker/19.03.12
Accept-Encoding: gzip
Connection: close
The Harbor registry responds with a 401 unauthorized when we try to retrieve the URI /v2/. It also adds a Www-Authenticate: header with the path to the registry’s token server:
HTTP/1.1 401 Unauthorized
Server: nginx
Date: Thu, 16 Jul 2020 18:59:45 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 87
Connection: close
Docker-Distribution-Api-Version: registry/2.0
Set-Cookie: beegosessionID=XYZ; Path=/; HttpOnly
Www-Authenticate: Bearer realm="https://harbor/service/token",service="harbor-registry"
{"errors":[{"code":"UNAUTHORIZED","message":"authentication required","detail":null}]}
Next, we try to retrieve an access token (a JWT in this case) from the token server:
GET /service/token?scope=repository%3Anginx%2Fingress%3Apull&service=harbor-registry HTTP/1.1
Host: harbor
User-Agent: docker/19.03.12
Accept-Encoding: gzip
Connection: close
The server responds with a 200 and an entity body (not included below) containing the access token (a JWT):
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 16 Jul 2020 18:59:45 GMT
Content-Type: application/json; charset=utf-8
Content-Length: 959
Connection: close
Content-Encoding: gzip
Set-Cookie: beegosessionID=XYZ; Path=/; HttpOnly
The scope in the JWT controls what you can do, and to which repositories. So now that we have an access token, we can retrieve the manifest for the container image:
GET /v2/nginx/ingress/manifests/v1.0.0 HTTP/1.1
Host: harbor
User-Agent: docker/19.03.12
Accept: application/vnd.docker.distribution.manifest.v1+prettyjws
Accept: application/json
Accept: application/vnd.docker.distribution.manifest.v2+json
Accept: application/vnd.docker.distribution.manifest.list.v2+json
Accept: application/vnd.oci.image.index.v1+json
Accept: application/vnd.oci.image.manifest.v1+json
Authorization: Bearer JWT.JWT.JWT
Accept-Encoding: gzip
Connection: close
If you aren’t familiar with manifests, they are JSON files that describe the container and the layers that make up the image. There is a schema which defines the manifest, and the following response shows an actual manifest sent back from the container registry:
HTTP/1.1 200 OK
Server: nginx
Date: Thu, 16 Jul 2020 18:59:45 GMT
Content-Type: application/vnd.docker.distribution.manifest.v2+json
Content-Length: 1154
Connection: close
Docker-Content-Digest: sha256:a7425073232ed3fb26b45ec6b26482e53984692ce6265b64f85c6c68b72c3cc5
Docker-Distribution-Api-Version: registry/2.0
Etag: "sha256:a7425073232ed3fb26b45ec6b26482e53984692ce6265b64f85c6c68b72c3cc5"
Set-Cookie: beegosessionID=XYZ; Path=/; HttpOnly
{
"schemaVersion": 2,
"mediaType": "application/vnd.docker.distribution.manifest.v2+json",
"config": {
"mediaType": "application/vnd.docker.container.image.v1+json",
"size": 2556,
"digest": "sha256:53a19cd1924db72bd427b3792cf8ee5be6f969caa33c7a32ed104a1561f37bb2"
},
"layers": [
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"size": 701123,
"digest": "sha256:2ea20e1f93179438e0f481d2f291580b0fd6808ce2c716e5f9fc961b2b038e4e"
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"size": 32,
"digest": "sha256:4f4fb700ef54461cfa02571ae0db9a0dc1e0cdb5577484a6d75e68dc38e8acc1"
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"size": 525239,
"digest": "sha256:5c59e002a478e367ed6aa3c1d0b22b98abbb8091378ef4c273dbadb368b735b1"
},
{
"mediaType": "application/vnd.docker.image.rootfs.diff.tar.gzip",
"size": 2087201,
"digest": "sha256:d1d2b2330ff21c60d278fb02908491f59868b942c6433e8699e405764bca5645"
}
]
In the manifest above, you can see the manifest version, the media type, and the image layers that make up the container image. This is all described in the official documentation. Now that dockerd knows the container image layout, if will retrieve one or more image layers in parallel (how many images to retrieve in parallel is controlled with the “–max-concurrent-download” option):
GET /v2/nginx/ingress/blobs/sha256:2ea20e1f93179438e0f481d2f291580b0fd6808ce2c716e5f9fc961b2b038e4e HTTP/1.1
Host: harbor
User-Agent: docker/19.03.12
Accept-Encoding: identity
Authorization: Bearer JWT.JWT.JWT
Connection: close
And that’s it. After all of the image layers are pulled, the next step would typically be to start a docker container`. Had a blast looking into this!