Blog O' Matty


The Kubernetes 110 pod limit per node

This article was posted by Matty on 2018-02-10 17:03:24 -0500 EST

This afternoon I decided to dig into the Kubernetes pod autoscaler and how workloads can be scaled by adding more replicas. To get started I created a kuard replica set with 800 pods. While this is a useless test I did it for one reason. To see if anything broke and to understand where (command output, logs, etc.) to look to find the reason it broke. After a minute or two kubectl get rs stauled at 544 pods:

$ kubectl get rs kuard

NAME      DESIRED   CURRENT   READY     AGE
kuard     800       800       544       1h

The remaining pods were all in a Pending state. The load on my systems was minimal (plenty of memory, CPU, disk and IOPs) and kubectl get nodes showed plenty of available resources. When I described one of the pending pods:

$ kubectl describe pod kuard-zxr4r

Name:           kuard-zxr4r
Namespace:      default
Node:           <none>
Labels:         run=kuard
Annotations:    <none>
Status:         Pending
......
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  56m (x17 over 1h)  default-scheduler  0/5 nodes are available: 1 NodeNotReady, 5 Insufficient pods.
  Warning  FailedScheduling  1m (x222 over 1h)  default-scheduler  0/5 nodes are available: 5 Insufficient pods.

It reported that there was an insufficient number of pods. After an hour reading through the Kubernetes code I came across a comment about there being a maximum pods per node limit. A bit of google’ing turned up the Building Large Clusters page which mentioned a 100 pod per node limit:

No more than 100 pods per node

If that was accurate the maximum number of pods I could schedule would be 500. But I had 550 PODs running so I wasn’t sure if that value was still accurate. A bit more digging led me to guthub issue 23349 (Increase maximum pods per node). @jeremyeder (thanks!) mentioned there was actually a 110 pod limit per node. This made a ton of sense. My workers run 1 DNS pod (I need to research running one DNS pod per worker) and 5 flannel pods. If you subtract that number from 550 (5 nodes * 110 pods per node) you get the value listed above.

I haven’t been able to find the rationale behind the value of 110. It’s common to get a machine with 40 - 80 CPUs and a half a terrabyte of RAM for a modest price. Curious if the Kubernetes folks have though about using the cluster size and current system metrics to derive this number on the fly? I’m sure there are pros and cons to increasing the maximum number of pods per node but I haven’t been able to find anything authoritative to explain the 110 pod limit or future plans to increase this. Now back to the autoscaler!

*** UPDATE ***

The kubfather mentioned on twitter that the maximum number of pods per node can be increased with the kubelet “–max-pods” option. Not sure how I missed that when I was reviewing the kubelet options.

*** UPDATE 2 ***

Changing the kubelet “–max-pods” option to 250 on each worker addressed my problem:

$ kubectl get rs

NAME      DESIRED   CURRENT   READY     AGE
kuard     800       800       800       3h

Using Terraform taint and Kubernetes cordon To rebuild nodes with no service interruption

This article was posted by Matty on 2018-02-10 11:32:08 -0500 EST

While digging through the Kubernetes networking stack I needed to install a number of tools to make analyzing the system a bit easier. I run all of my systems with a minimal image and haven’t had time to study debugging side cars in depth (it’s definitely on the list of things to learn). Once I completed my work I wanted to return my system to the state it was in prior to the additional packages being installed. While I could have run yum remove to purge the packages I thought this would be an excellent opportunity to learn more about the Kubernetes node add and removal process.

Terraform has a really interesting taint option which forces a resource to be destroyed and recreated. The Kubernetes kubectl command has the drain, cordon and uncordon commands which allow you to disable resource scheduling and evict pods from a node. When I stitched these two together I was able to rebuild the node with zero impact to my cluster.

To move pods off of the worker I wanted to rebuild I used the kubectl drain command:

$ kubectl drain kubworker1.prefetch.net

node "kubworker1.prefetch.net" cordoned
pod "kuard-lpzlr" evicted
node "kubworker1.prefetch.net" drained

This will first cordon (disable resource scheduling) the node then evict any pods that are running. Once the opration completes you can list your nodes to verify that scheduling was disabled:

$ kubectl get nodes

NAME                       STATUS                     ROLES     AGE       VERSION
kubworker1.prefetch.net   Ready,SchedulingDisabled   <none>    22h       v1.9.2
kubworker2.prefetch.net   Ready                      <none>    22h       v1.9.2
kubworker3.prefetch.net   Ready                      <none>    22h       v1.9.2
kubworker4.prefetch.net   Ready                      <none>    22h       v1.9.2
kubworker5.prefetch.net   Ready                      <none>    22h       v1.9.2

To remove the node from the cluster I used the kubectl delete command:

$ kubectl delete node kubworker1.prefetch.net

Now that the node was purged from Kubernetes I needed to taint it with Terraform to force a rebuild. The terraform taint option needs to be used with extreme care! One mistake and you can say goodbye to resources you didn’t intend to destroy (so use this information at your own risk). To locate the resource I wanted to taint I ran terraform show and grep’ed for the keyword referring to the node:

$ terraform show | grep vsphere_virtual_machine.kubernetes_worker

vsphere_virtual_machine.kubernetes_worker.0:
vsphere_virtual_machine.kubernetes_worker.1:
vsphere_virtual_machine.kubernetes_worker.2:
vsphere_virtual_machine.kubernetes_worker.3:
vsphere_virtual_machine.kubernetes_worker.4:

To find the right node I checked the name field in the terraform show output with the name of the Kubernetes node. Once I was 100% certain I had the right node I tainted the resource:

$ terraform taint vsphere_virtual_machine.kubernetes_worker.0

The resource vsphere_virtual_machine.kubernetes_worker.0 in the module root has been marked as tainted!

Once the resource was tainted I ran Terraform plan to verify the steps that would be performed during the next apply. This is a CRITICAL step and I would highly suggest getting a second set of eyes to review your plan. Everything looked good so I applied my changes which caused the server to begin rebuilding and eventually add itself back to the cluster. I loves me some Terraform and Kubernetes!

Understanding the network plumbing that makes Kubernetes pods and services work

This article was posted by Matty on 2018-02-09 18:00:00 -0500 EST

This morning I wanted to better understand how requests to ClusterIPs get routed to Kubernetes pods. Properly functioning networking is critical to Kubernetes and having a solid understanding of what happens under the covers makes debugging problems much, much easier. To get started with my studies I fired up five kuard pods:

$ kubectl create -f kuard.yaml

replicaset "kuard" created

$ kubectl get pods -o wide

NAME          READY     STATUS    RESTARTS   AGE       IP         NODE
kuard-8xwx7   1/1       Running   0          36s       10.1.4.3   kubworker4.prefetch.net
kuard-bd4cj   1/1       Running   0          36s       10.1.1.3   kubworker2.prefetch.net
kuard-hfkgd   1/1       Running   0          36s       10.1.2.4   kubworker5.prefetch.net
kuard-j9fks   1/1       Running   0          36s       10.1.0.3   kubworker3.prefetch.net
kuard-lpzlr   1/1       Running   0          36s       10.1.3.3   kubworker1.prefetch.net

I created 5 pods so one would hopefully be placed on each worker node. Once the pods finished creating I exposed the pods to the cluster with the kubectl expose command:

$ kubectl expose rs kuard --port=8080 --target-port=8080

$ kubectl get svc -o wide kuard

NAME      TYPE        CLUSTER-IP    EXTERNAL-IP   PORT(S)    AGE       SELECTOR
kuard     ClusterIP   10.2.21.155   <none>        8080/TCP   20s       run=kuard

Behind the scenes kube-proxy uses iptables-save and iptables-restore to add rules. Here is the first rule that applies to the kuard service I exposed above:

-A KUBE-SERVICES -d 10.2.21.155/32 -p tcp -m comment --comment "default/kuard: cluster IP" -m tcp --dport 8080 -j KUBE-SVC-CUXC5A3HHHVSSN62

This rule checks if the destination (argument to “-d”) matches the cluster IP, the destination port (argument to –dport) is 8080 and the protocol (argument to “-p”) is tcp. If that check passes the rule will jump to the KUBE-SVC-CUXC5A3HHHVSSN62 target. Here are the rules in the KUBE-SVC-CUXC5A3HHHVSSN62 chain:

-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-CA6TP3H7ZVLC3JFW
-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-ZHHZWPGVXXVHUF5F
-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-H2VR42IC623XBWYH
-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-AXZRC2VTEV7ZDZ2C
-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -j KUBE-SEP-5NFQVOMYN3PVBXGK

This chain contains one rule per pod. Each pod is assigned a probablility and the iptables-extension statistic module is used to pick the node that best matches. Once a node is selected iptables will jump to the target passed to “-j”. Here are the chains it will jump to:

-A KUBE-SEP-5NFQVOMYN3PVBXGK -s 10.1.4.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-5NFQVOMYN3PVBXGK -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.4.3:8080
-A KUBE-SEP-AXZRC2VTEV7ZDZ2C -s 10.1.3.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-AXZRC2VTEV7ZDZ2C -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.3.3:8080
-A KUBE-SEP-CA6TP3H7ZVLC3JFW -s 10.1.0.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-CA6TP3H7ZVLC3JFW -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.0.3:8080
-A KUBE-SEP-H2VR42IC623XBWYH -s 10.1.2.4/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-H2VR42IC623XBWYH -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.2.4:8080
-A KUBE-SEP-ZHHZWPGVXXVHUF5F -s 10.1.1.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZHHZWPGVXXVHUF5F -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.1.3:8080

Now here’s where the magic occurs! Once a chain is picked the service IP will be NAT’ed to the destination node’s pod IP via the “–to-destination” option. Traffic will then traverse the hosts public network interface and arrive at the destination where it can be funneled to the pod (it’s pretty amazing and scary how this works behind the scenes). If I curl the service IP on port 8080:

$ curl 10.2.21.155:8080 > /dev/null

We can see the initial SYN and the translated destination (the IP of the pod to send the request to) with tcpdump:

$ tcpdump -n -i ens192 port 8080 and 'tcp[tcpflags] == tcp-syn'

tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes
13:02:09.129502 IP 192.168.2.44.48102 > 10.1.2.4.webcache: Flags [S], seq 811928755, win 29200, options [mss 1460,sackOK,TS val 3048081500 ecr 0,nop,wscale 7], length 0

The rules utilize the connmark mark option to mark packets. I’m not 100% sure how this works (or the purpose) and will need to do some digging this weekend to see what the deal is. I learned a lot digging through packet captures and iptables and defintely have a MUCH better understanding of how pods and service IPs play with each other.

Notes from episode 6 of TGIK: Kubeadm

This article was posted by Matty on 2018-02-08 20:00:00 -0500 EST

Over the past few months I’ve been trying to learn everything there is to know about Kubernetes. Kubernetes is an amazing technology for deploying and scaling containers though it comes with a cost. It’s an incredibly complex piece of software and there are a ton of bells and whistles to become familiar with. One way that I’ve found for coming up to speed is Joe Beda’s weekly TGIK live broadcast. This occurs each Friday at 4PM EST and is CHOCK full of fantastic information. In episode six Joe talks about kubeadm. You can watch it here:

Here are some of my takeways from the episode:

Things I need to learn more about:

Notes from episode 5 of TGIK: Pod Params and Probes

This article was posted by Matty on 2018-02-08 18:00:00 -0500 EST

Over the past few months I’ve been trying to learn everything there is to know about Kubernetes. Kubernetes is an amazing technology for deploying and scaling containers though it comes with a cost. It’s an incredibly complex piece of software and there are a ton of bells and whistles to become familiar with. One way that I’ve found for coming up to speed is Joe Beda’s weekly TGIK live broadcast. This occurs each Friday at 4PM EST and is CHOCK full of fantastic information. In episode five Joe talks about pod parameters and probes. You can watch it here:

Here are some of my takeways from the episode:

Things I need to learn more about: