Blog O' Matty


Using the Kubernetes K14S kapp utility to view deployment manifest changes prior to applying them

This article was posted by on 2020-08-14 01:00:00 -0500 -0500

If you’ve worked with Kubernetes for any length of time, you are probably intimately familiar with deployment manifests. If this concept is new to you, deployment manifests are used to add resources to a cluster in a declarative manor. Some of the larger projects (cert-manager, Istio, CNI plug-ins, etc.) in the Kubernetes ecosystem provide manifests to deploy the resources that make their application work. These can often be 1000s of lines, and if you are security conscious you don’t want to deploy anything to a cluster without validating what it is.

The K14S project took this issue to heart when they released the kapp utility. This super useful utility can help you see the changes that would take place to a cluster, but without actually making any changes. To show how useful this is, lets say you wanted to see which resource Istio would deploy. You can see this with kapp deploy:

$ kapp deploy -a istio -f <(kustomize build)

Target cluster 'https://127.0.0.1:33783' (nodes: test-control-plane, 3+)

Changes

Namespace       Name                             Kind                      Conds.  Age  Op      Op st.  Wait to    Rs  Ri  
(cluster)       istio-operator                   ClusterRole               -       -    create  -       reconcile  -   -  
^               istio-operator                   ClusterRoleBinding        -       -    create  -       reconcile  -   -  
^               istio-operator                   Namespace                 -       -    create  -       reconcile  -   -  
^               istio-system                     Namespace                 -       -    create  -       reconcile  -   -  
^               istiooperators.install.istio.io  CustomResourceDefinition  -       -    create  -       reconcile  -   -  
istio-operator  istio-operator                   Deployment                -       -    create  -       reconcile  -   -  
^               istio-operator                   ServiceAccount            -       -    create  -       reconcile  -   -  
^               istio-operator-metrics           Service                   -       -    create  -       reconcile  -   -  

Op:      8 create, 0 delete, 0 update, 0 noop
Wait to: 8 reconcile, 0 delete, 0 noop

Continue? [yN]: N

The output contains the resource type and the operation that will take place. In the example above we are going to create 8 resources, and assign the application name “istio” (a label) to each resource. Kapp deploy can also be fed the “–diff-changes” option to display a diff between the manifests and the current cluster state, “–allow-ns” to specify the namespaces that the app has to go into, and the “–into-ns” to map the namespaces in the manifests to one of your choosing. Kapp will assign a label to the resources it deploys, which is used by “list” to show resources that are managed by kapp:

$ kapp list

Target cluster 'https://127.0.0.1:33783' (nodes: test-control-plane, 3+)

Apps in namespace 'default'

Name   Namespaces                Lcs   Lca  
istio  (cluster),istio-operator  true  4d  
nginx  -                         -     -  

Lcs: Last Change Successful
Lca: Last Change Age

2 apps

Succeeded

Another super useful feature of kapp is its ability to inspect an application that was previously deployed:

$ kapp inspect -a istio --tree

Target cluster 'https://127.0.0.1:33783' (nodes: test-control-plane, 3+)

Resources in app 'istio'

Namespace       Name                                  Kind                      Owner    Conds.  Rs  Ri  Age  
(cluster)       istio-operator                        ClusterRole               kapp     -       ok  -   4d  
istio-operator  istio-operator                        ServiceAccount            kapp     -       ok  -   4d  
(cluster)       istiooperators.install.istio.io       CustomResourceDefinition  kapp     2/2 t   ok  -   4d  
istio-operator  istio-operator-metrics                Service                   kapp     -       ok  -   4d  
istio-operator   L istio-operator-metrics             Endpoints                 cluster  -       ok  -   4d  
(cluster)       istio-operator                        ClusterRoleBinding        kapp     -       ok  -   4d  
(cluster)       istio-system                          Namespace                 kapp     -       ok  -   4d  
(cluster)       istio-operator                        Namespace                 kapp     -       ok  -   4d  
istio-operator  istio-operator                        Deployment                kapp     2/2 t   ok  -   4d  
istio-operator   L istio-operator-77d57c5c57          ReplicaSet                cluster  -       ok  -   4d  
istio-operator   L.. istio-operator-77d57c5c57-dkl8b  Pod                       cluster  4/4 t   ok  -   4d  

Rs: Reconcile state
Ri: Reconcile information

11 resources

Succeeded

In the output above you can see the resource relationships in tree form, the object type, the owner, and the state of the resource. This is a crazy useful utility, and one I’ve started to use almost daily. It’s super useful for observing the state of a cluster, and for debugging problems. Thanks K14S for this amazing piece of software!

Upgrading an RPM to a specific version with yum

This article was posted by on 2020-08-14 00:00:00 -0500 -0500

This past week I got to spend some time upgrading my CI/CD systems. The Gitlab upgrade process requires stepping to a specific version when you upgrade major versions, which can be a problem if the latest version isn’t supported by the upgrade scripts . In these types of situations, you can tell yum to upgrade to a specific version. To list the versions of a package that are available, you can use the search commands “–showduplicates” option:

$ yum search --showduplicates gitlab-ee | grep 13.0

gitlab-ee-13.0.0-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,
gitlab-ee-13.0.1-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,
gitlab-ee-13.0.3-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,
gitlab-ee-13.0.4-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,
gitlab-ee-13.0.5-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,
gitlab-ee-13.0.6-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,
gitlab-ee-13.0.7-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,
gitlab-ee-13.0.8-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,
gitlab-ee-13.0.9-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,
gitlab-ee-13.0.10-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,
gitlab-ee-13.0.12-ee.0.el7.x86_64 : GitLab Enterprise Edition (including NGINX,

Once you eye the version you want, you can pass it to yum install:

$ yum install gitlab-ee-13.0.12-ee.0.el7.x86_64

This can also be useful if you want to stick to a minor version vs. upgrading to a new major release.

Using Kubernetes affinity rules to control where your pods are scheduled

This article was posted by on 2020-08-03 02:00:00 -0500 -0500

Kubernetes has truly revolutioned distributed computing. While it solves a number of super hard problems, it also adds a number of new challenges. One of these challenges is ensuring your Kubernetes clusters are designed with failure domains in mind. Designing around failure domains includes things like provisioning infrastructure across availability zones, ensuring your physical servers are in different racks, or making sure the pods that support your application don’t wind up on the same physical Kubernetes worker.

Inter-pod affinity and anti-affinity rules can be used to address the last point, and the official Kubernetes documentation does a really good job of describing them:

“Inter-pod affinity and anti-affinity allow you to constrain which nodes your pod is eligible to be scheduled based on labels on pods that are already running on the node rather than based on labels on nodes. The rules are of the form “this pod should (or, in the case of anti-affinity, should not) run in an X if that X is already running one or more pods that meet rule Y”. Y is expressed as a LabelSelector with an optional associated list of namespaces; unlike nodes, because pods are namespaced (and therefore the labels on pods are implicitly namespaced), a label selector over pod labels must specify which namespaces the selector should apply to. Conceptually X is a topology domain like node, rack, cloud provider zone, cloud provider region, etc. You express it using a topologyKey which is the key for the node label that the system uses to denote such a topology domain; for example, see the label keys listed above in the section Interlude: built-in node labels."

Affinities can be defined with an affinity statement in a deployment manifest. So given a 3-node cluster:

$ kubectl get nodes

NAME                 STATUS   ROLES    AGE   VERSION
test-control-plane   Ready    master   22d   v1.18.2
test-worker          Ready    <none>   22d   v1.18.2
test-worker2         Ready    <none>   22d   v1.18.2
test-worker3         Ready    <none>   22d   v1.18.2

You can create an affinity rule by adding an affinity stanza to the pods spec:

$ cat nginx.yaml

apiVersion: apps/v1
kind: Deployment
metadata:
  labels:
    app: nginx
  name: nginx
spec:
  replicas: 3
  selector:
    matchLabels:
      app: nginx
  template:
    metadata:
      labels:
        app: nginx
    spec:
      containers:
      - image: nginx
        name: nginx
      affinity:
        podAntiAffinity:
           requiredDuringSchedulingIgnoredDuringExecution:
             - labelSelector:
                 matchExpressions:
                   - key: "app"
                     operator: In
                     values:
                     - nginx
               topologyKey: "kubernetes.io/hostname"

There is a lot going on in the affinity section, so I will break down each piece. Affinity provides the following 3 scheduling constraints:

$ kubectl explain pod.spec.affinity

KIND:     Pod
VERSION:  v1

RESOURCE: affinity <Object>

DESCRIPTION:
     If specified, the pod's scheduling constraints

     Affinity is a group of affinity scheduling rules.

FIELDS:
   nodeAffinity <Object>
     Describes node affinity scheduling rules for the pod.

   podAffinity  <Object>
     Describes pod affinity scheduling rules (e.g. co-locate this pod in the
     same node, zone, etc. as some other pod(s)).

   podAntiAffinity  <Object>
     Describes pod anti-affinity scheduling rules (e.g. avoid putting this pod
     in the same node, zone, etc. as some other pod(s)).

In the example above I am using the podAntiAffinity rule, which can be used to avoid placing two similar pods together. The labelSelector map contains an expression to match pods that will have the affinity rules applied to it. And lastly, the topologyKey is used to specify the item that you want the affinity rule applied to. In this example I specified the hostname topology key, which will prevent two pods that match the labelSelector to be placed on a single node.

Once this deployment is created, we can verify that each pod was scheduled to a unique worker node:

$ kubectl get po -o wide

NAME                     READY   STATUS    RESTARTS   AGE   IP          NODE           NOMINATED NODE   READINESS GATES
nginx-75db5d94dc-4w8q9   1/1     Running   0          72s   10.11.3.2   test-worker3   <none>           <none>
nginx-75db5d94dc-5wwm2   1/1     Running   0          72s   10.11.1.5   test-worker    <none>           <none>
nginx-75db5d94dc-cbxs5   1/1     Running   0          72s   10.11.2.2   test-worker2   <none>           <none>

But with with any affinity implementation, there are always subtleties you need to be aware of. In the example above, what happens if you need to scale the deployment to handle additional load? We can see what happens first hand:

$ kubectl scale deploy nginx --replicas 6

If we review the pod list:

$ kubectl get po

NAME                     READY   STATUS    RESTARTS   AGE
nginx-75db5d94dc-2sltl   0/1     Pending   0          21s
nginx-75db5d94dc-4w8q9   1/1     Running   0          14m
nginx-75db5d94dc-5wwm2   1/1     Running   0          14m
nginx-75db5d94dc-cbxs5   1/1     Running   0          14m
nginx-75db5d94dc-jxkqs   0/1     Pending   0          21s
nginx-75db5d94dc-qzxmb   0/1     Pending   0          21s

We see that the new pods are stuck in the Pending state. That’s because we only have three nodes, and the affinity rule will prevent two pods that are similar from being scheduled to the same node. The Kubernetes scheduler does a solid job out of the box, but sometimes you need a bit more control over where your pods wind up. This is especially the case when you are using multiple availability zones in the “cloud”, and need to ensure that pods get distributed between them. I will loop back around to this topic in a future post where I’ll discuss zone topology keys and spread priorities.

Using the Ansible uri module to test web services during playbook execution

This article was posted by on 2020-08-03 01:00:00 -0500 -0500

Ansible has amazing support for testing services during playbook execution. This is super useful for validating your services are working after a set of changes take place, and when combined with serial you can stop execution if a change negatively impacts one one or more servers in your fleet. Ansible has a number of modules that can be used to test services, including the uri module.

The uri module allows Ansible to interact with a web endpoint, and provides numerous options to control its behavior. When I apply OS updates to my kubelets, I typically use the reboot module along with uri to verify that the kubelet healthz endpoint is returning a 200 status code after updates are applied:

- name: Reboot the server to pick up a kernel update
    reboot:
    reboot_timeout: 600

- name: Wait for the kubelet healthz endpoint to return a 200
    uri:
    url: "http://{{ inventory_hostname }}:10256/healthz"
    method: GET
    status_code: 200
    return_content: no
  register: result
  until: result.status == 200
  retries: 60
  delay: 1

In the example above, the uri module issues an HTTP GET to the kubelet’s healthz endpoint, and checks the response for a 200 status code. It will also retry the GET operation 60 times, waiting one second between each request. This allows you to update one or more hosts, test them after the change is made, then move on to next system if the service running on that system is healthy. If the update breaks a service, you can fail the playbook run immediately. Good stuff!

Debugging Kubernetes network issues with nsenter, dig and tcpdump

This article was posted by on 2020-08-03 00:00:00 -0500 -0500

As a Kubernetes administrator I frequently find myself needing to debug application and system issues. Most of the issues I encounter can be solved with Grafana dashboards and Prometheus metrics, or by running one or more Elasticsearch queries to examine logs. But there are times when I need to go deeper and actually inspect activity inside a running pod. A lot of debugging guides use the kubectl exec command to run one or more commands inside a container:

$ kubectl exec -it container-XXXX dig @10.10.0.1 google.com

But what happens if you don’t have a shell installed in the container? Or what if your container runs as an unprivileged user (which it should), and the tools you need to debug the issue aren’t installed? Kinda hard to install utilities if you don’t have root, and it defeats the whole point of ephemeral infrastructure. In these situations the Linux nsenter command will become your best friend!

If you aren’t familiar with nsenter, it allows you to run a program in a given namespace. So lets say you have a microservice running in your Kubernetes cluster, and your developers tell you that DNS resolution isn’t working correctly. To debug this issue with nsenter, you can access the host the service is running on, and execute nsenter with the “-t” (process to target) and “-n” (enter the network namespace) options. The final argument is the command to run in the processes network namespace:

$ nsenter -t 1294 -n dig +short @10.11.2.2 *.*.svc.cluster.local

10.10.0.10
10.10.0.1

In the example above, nsenter ran the dig command again the cluster DNS service IP. It also used the dig binary that resides on the hosts file system, not the containers. Nsenter is also super helpful when you need to capture traffic going in and out of a container:

$ nsenter -t 1294 -n tcpdump -i eth0 port 80 and "tcp[tcpflags] & tcp-syn != 0"

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on eth0, link-type EN10MB (Ethernet), capture size 262144 bytes
06:04:20.424938 IP 10.11.2.1.39168 > 10.11.2.3.80: Flags [S], seq 1491800784, win 29200, options [mss 1460,sackOK,TS val 59669904 ecr 0,nop,wscale 7], length 0
06:04:20.425000 IP 10.11.2.3.80 > 10.11.2.1.39168: Flags [S.], seq 3823341284, ack 1491800785, win 28960, options [mss 1460,sackOK,TS val 59669904 ecr 59669904,nop,wscale 7], length 0

In the example above, nsenter executed the tcpdump utility inside process ID 1294’s namespace. What makes this super powerful is the fact that you can run your containers with the minimum number of bits needed to run your application, and your application can also run as an unprivileged user. When you need to debug issues you don’t need to touch the container. You just fire up the binary on your Kubernetes worker and debug away.