This afternoon I decided to dig into the Kubernetes pod autoscaler and how workloads can be scaled by adding more replicas. To get started I created a kuard replica set with 800 pods. While this is a useless test I did it for one reason. To see if anything broke and to understand where (command output, logs, etc.) to look to find the reason it broke. After a minute or two
kubectl get rs stauled at 544 pods:
kubectl get rs kuard
NAME DESIRED CURRENT READY AGE kuard 800 800 544 1h
The remaining pods were all in a Pending state. The load on my systems was minimal (plenty of memory, CPU, disk and IOPs) and kubectl get nodes showed plenty of available resources. When I described one of the pending pods:
kubectl describe pod kuard-zxr4r
Name: kuard-zxr4r Namespace: default Node: <none> Labels: run=kuard Annotations: <none> Status: Pending ...... Type Reason Age From Message ---- ------ ---- ---- ------- Warning FailedScheduling 56m (x17 over 1h) default-scheduler 0/5 nodes are available: 1 NodeNotReady, 5 Insufficient pods. Warning FailedScheduling 1m (x222 over 1h) default-scheduler 0/5 nodes are available: 5 Insufficient pods.
It reported that there was an insufficient number of pods. After an hour reading through the Kubernetes code I came across a comment about there being a maximum pods per node limit. A bit of google’ing turned up the Building Large Clusters page which mentioned a 100 pod per node limit:
No more than 100 pods per node
If that was accurate the maximum number of pods I could schedule would be 500. But I had 550 PODs running so I wasn’t sure if that value was still accurate. A bit more digging led me to guthub issue 23349 (Increase maximum pods per node). @jeremyeder (thanks!) mentioned there was actually a 110 pod limit per node. This made a ton of sense. My workers run 1 DNS pod (I need to research running one DNS pod per worker) and 5 flannel pods. If you subtract that number from 550 (5 nodes * 110 pods per node) you get the value listed above.
I haven’t been able to find the rationale behind the value of 110. It’s common to get a machine with 40 - 80 CPUs and a half a terrabyte of RAM for a modest price. Curious if the Kubernetes folks have though about using the cluster size and current system metrics to derive this number on the fly? I’m sure there are pros and cons to increasing the maximum number of pods per node but I haven’t been able to find anything authoritative to explain the 110 pod limit or future plans to increase this. Now back to the autoscaler!
*** UPDATE ***
The kubfather mentioned on twitter that the maximum number of pods per node can be increased with the kubelet “–max-pods” option. Not sure how I missed that when I was reviewing the kubelet options.
*** UPDATE 2 ***
Changing the kubelet “–max-pods” option to 250 on each worker addressed my problem:
kubectl get rs
NAME DESIRED CURRENT READY AGE kuard 800 800 800 3h
While digging through the Kubernetes networking stack I needed to install a number of tools to make analyzing the system a bit easier. I run all of my systems with a minimal image and haven’t had time to study debugging side cars in depth (it’s definitely on the list of things to learn). Once I completed my work I wanted to return my system to the state it was in prior to the additional packages being installed. While I could have run yum remove to purge the packages I thought this would be an excellent opportunity to learn more about the Kubernetes node add and removal process.
Terraform has a really interesting taint option which forces a resource to be destroyed and recreated. The Kubernetes kubectl command has the drain, cordon and uncordon commands which allow you to disable resource scheduling and evict pods from a node. When I stitched these two together I was able to rebuild the node with zero impact to my cluster.
To move pods off of the worker I wanted to rebuild I used the kubectl drain command:
kubectl drain kubworker1.prefetch.net
node "kubworker1.prefetch.net" cordoned pod "kuard-lpzlr" evicted node "kubworker1.prefetch.net" drained
This will first cordon (disable resource scheduling) the node then evict any pods that are running. Once the opration completes you can list your nodes to verify that scheduling was disabled:
kubectl get nodes
NAME STATUS ROLES AGE VERSION kubworker1.prefetch.net Ready,SchedulingDisabled <none> 22h v1.9.2 kubworker2.prefetch.net Ready <none> 22h v1.9.2 kubworker3.prefetch.net Ready <none> 22h v1.9.2 kubworker4.prefetch.net Ready <none> 22h v1.9.2 kubworker5.prefetch.net Ready <none> 22h v1.9.2
To remove the node from the cluster I used the kubectl delete command:
kubectl delete node kubworker1.prefetch.net
Now that the node was purged from Kubernetes I needed to taint it with Terraform to force a rebuild. The terraform taint option needs to be used with extreme care! One mistake and you can say goodbye to resources you didn’t intend to destroy (so use this information at your own risk). To locate the resource I wanted to taint I ran terraform show and grep’ed for the keyword referring to the node:
terraform show | grep vsphere_virtual_machine.kubernetes_worker
vsphere_virtual_machine.kubernetes_worker.0: vsphere_virtual_machine.kubernetes_worker.1: vsphere_virtual_machine.kubernetes_worker.2: vsphere_virtual_machine.kubernetes_worker.3: vsphere_virtual_machine.kubernetes_worker.4:
To find the right node I checked the name field in the terraform show output with the name of the Kubernetes node. Once I was 100% certain I had the right node I tainted the resource:
terraform taint vsphere_virtual_machine.kubernetes_worker.0
The resource vsphere_virtual_machine.kubernetes_worker.0 in the module root has been marked as tainted!
Once the resource was tainted I ran Terraform plan to verify the steps that would be performed during the next apply. This is a CRITICAL step and I would highly suggest getting a second set of eyes to review your plan. Everything looked good so I applied my changes which caused the server to begin rebuilding and eventually add itself back to the cluster. I loves me some Terraform and Kubernetes!
This morning I wanted to better understand how requests to ClusterIPs get routed to Kubernetes pods. Properly functioning networking is critical to Kubernetes and having a solid understanding of what happens under the covers makes debugging problems much, much easier. To get started with my studies I fired up five kuard pods:
kubectl create -f kuard.yaml
replicaset "kuard" created
kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE kuard-8xwx7 1/1 Running 0 36s 10.1.4.3 kubworker4.prefetch.net kuard-bd4cj 1/1 Running 0 36s 10.1.1.3 kubworker2.prefetch.net kuard-hfkgd 1/1 Running 0 36s 10.1.2.4 kubworker5.prefetch.net kuard-j9fks 1/1 Running 0 36s 10.1.0.3 kubworker3.prefetch.net kuard-lpzlr 1/1 Running 0 36s 10.1.3.3 kubworker1.prefetch.net
I created 5 pods so one would hopefully be placed on each worker node. Once the pods finished creating I exposed the pods to the cluster with the kubectl expose command:
kubectl expose rs kuard --port=8080 --target-port=8080
kubectl get svc -o wide kuard
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR kuard ClusterIP 10.2.21.155 <none> 8080/TCP 20s run=kuard
Behind the scenes kube-proxy uses iptables-save and iptables-restore to add rules. Here is the first rule that applies to the kuard service I exposed above:
-A KUBE-SERVICES -d 10.2.21.155/32 -p tcp -m comment --comment "default/kuard: cluster IP" -m tcp --dport 8080 -j KUBE-SVC-CUXC5A3HHHVSSN62
This rule checks if the destination (argument to “-d”) matches the cluster IP, the destination port (argument to –dport) is 8080 and the protocol (argument to “-p”) is tcp. If that check passes the rule will jump to the KUBE-SVC-CUXC5A3HHHVSSN62 target. Here are the rules in the KUBE-SVC-CUXC5A3HHHVSSN62 chain:
-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-CA6TP3H7ZVLC3JFW -A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-ZHHZWPGVXXVHUF5F -A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-H2VR42IC623XBWYH -A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-AXZRC2VTEV7ZDZ2C -A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -j KUBE-SEP-5NFQVOMYN3PVBXGK
This chain contains one rule per pod. Each pod is assigned a probablility and the iptables-extension statistic module is used to pick the node that best matches. Once a node is selected iptables will jump to the target passed to “-j”. Here are the chains it will jump to:
-A KUBE-SEP-5NFQVOMYN3PVBXGK -s 10.1.4.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ -A KUBE-SEP-5NFQVOMYN3PVBXGK -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.4.3:8080 -A KUBE-SEP-AXZRC2VTEV7ZDZ2C -s 10.1.3.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ -A KUBE-SEP-AXZRC2VTEV7ZDZ2C -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.3.3:8080 -A KUBE-SEP-CA6TP3H7ZVLC3JFW -s 10.1.0.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ -A KUBE-SEP-CA6TP3H7ZVLC3JFW -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.0.3:8080 -A KUBE-SEP-H2VR42IC623XBWYH -s 10.1.2.4/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ -A KUBE-SEP-H2VR42IC623XBWYH -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.2.4:8080 -A KUBE-SEP-ZHHZWPGVXXVHUF5F -s 10.1.1.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ -A KUBE-SEP-ZHHZWPGVXXVHUF5F -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.1.3:8080
Now here’s where the magic occurs! Once a chain is picked the service IP will be NAT’ed to the destination node’s pod IP via the “–to-destination” option. Traffic will then traverse the hosts public network interface and arrive at the destination where it can be funneled to the pod (it’s pretty amazing and scary how this works behind the scenes). If I curl the service IP on port 8080:
curl 10.2.21.155:8080 > /dev/null
We can see the initial SYN and the translated destination (the IP of the pod to send the request to) with tcpdump:
tcpdump -n -i ens192 port 8080 and 'tcp[tcpflags] == tcp-syn'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes 13:02:09.129502 IP 192.168.2.44.48102 > 10.1.2.4.webcache: Flags [S], seq 811928755, win 29200, options [mss 1460,sackOK,TS val 3048081500 ecr 0,nop,wscale 7], length 0
The rules utilize the connmark mark option to mark packets. I’m not 100% sure how this works (or the purpose) and will need to do some digging this weekend to see what the deal is. I learned a lot digging through packet captures and iptables and defintely have a MUCH better understanding of how pods and service IPs play with each other.
Over the past few months I’ve been trying to learn everything there is to know about Kubernetes. Kubernetes is an amazing technology for deploying and scaling containers though it comes with a cost. It’s an incredibly complex piece of software and there are a ton of bells and whistles to become familiar with. One way that I’ve found for coming up to speed is Joe Beda’s weekly TGIK live broadcast. This occurs each Friday at 4PM EST and is CHOCK full of fantastic information. In episode six Joe talks about kubeadm. You can watch it here:
Here are some of my takeways from the episode:
kubeadm initcan be used to initialize the control plane
kubeadm token list
kubeadm join --token xxxx.xxxxxxx APISERVERIP:6443
Things I need to learn more about:
Over the past few months I’ve been trying to learn everything there is to know about Kubernetes. Kubernetes is an amazing technology for deploying and scaling containers though it comes with a cost. It’s an incredibly complex piece of software and there are a ton of bells and whistles to become familiar with. One way that I’ve found for coming up to speed is Joe Beda’s weekly TGIK live broadcast. This occurs each Friday at 4PM EST and is CHOCK full of fantastic information. In episode five Joe talks about pod parameters and probes. You can watch it here:
Here are some of my takeways from the episode:
kubectl get pod NAME -o yamlwon’t work out of the box due to system-specific information being present (e.g., resourceVersion).
kubectl get pods ubuntu -o yaml --export
kubectl run ubuntu --image=ubuntu -o yaml --dry-run
kubectl get pods ubuntu -w -o wide
kubectl get endpoints
Things I need to learn more about: