Using Terraform taint and Kubernetes cordon To rebuild nodes with no service interruption


While digging through the Kubernetes networking stack I needed to install a number of tools to make analyzing the system a bit easier. I run all of my systems with a minimal image and haven’t had time to study debugging side cars in depth (it’s definitely on the list of things to learn). Once I completed my work I wanted to return my system to the state it was in prior to the additional packages being installed. While I could have run yum remove to purge the packages I thought this would be an excellent opportunity to learn more about the Kubernetes node add and removal process.

Terraform has a really interesting taint option which forces a resource to be destroyed and recreated. The Kubernetes kubectl command has the drain, cordon and uncordon commands which allow you to disable resource scheduling and evict pods from a node. When I stitched these two together I was able to rebuild the node with zero impact to my cluster.

To move pods off of the worker I wanted to rebuild I used the kubectl drain command:

$ kubectl drain kubworker1.prefetch.net

node "kubworker1.prefetch.net" cordoned
pod "kuard-lpzlr" evicted
node "kubworker1.prefetch.net" drained

This will first cordon (disable resource scheduling) the node then evict any pods that are running. Once the opration completes you can list your nodes to verify that scheduling was disabled:

$ kubectl get nodes

NAME                       STATUS                     ROLES     AGE       VERSION
kubworker1.prefetch.net   Ready,SchedulingDisabled   <none>    22h       v1.9.2
kubworker2.prefetch.net   Ready                      <none>    22h       v1.9.2
kubworker3.prefetch.net   Ready                      <none>    22h       v1.9.2
kubworker4.prefetch.net   Ready                      <none>    22h       v1.9.2
kubworker5.prefetch.net   Ready                      <none>    22h       v1.9.2

To remove the node from the cluster I used the kubectl delete command:

$ kubectl delete node kubworker1.prefetch.net

Now that the node was purged from Kubernetes I needed to taint it with Terraform to force a rebuild. The terraform taint option needs to be used with extreme care! One mistake and you can say goodbye to resources you didn’t intend to destroy (so use this information at your own risk). To locate the resource I wanted to taint I ran terraform show and grep’ed for the keyword referring to the node:

$ terraform show | grep vsphere_virtual_machine.kubernetes_worker

vsphere_virtual_machine.kubernetes_worker.0:
vsphere_virtual_machine.kubernetes_worker.1:
vsphere_virtual_machine.kubernetes_worker.2:
vsphere_virtual_machine.kubernetes_worker.3:
vsphere_virtual_machine.kubernetes_worker.4:

To find the right node I checked the name field in the terraform show output with the name of the Kubernetes node. Once I was 100% certain I had the right node I tainted the resource:

$ terraform taint vsphere_virtual_machine.kubernetes_worker.0

The resource vsphere_virtual_machine.kubernetes_worker.0 in the module root has been marked as tainted!

Once the resource was tainted I ran Terraform plan to verify the steps that would be performed during the next apply. This is a CRITICAL step and I would highly suggest getting a second set of eyes to review your plan. Everything looked good so I applied my changes which caused the server to begin rebuilding and eventually add itself back to the cluster. I loves me some Terraform and Kubernetes!

This article was posted by Matty on 2018-02-10 11:32:08 -0500 -0500