This afternoon I decided to dig into the Kubernetes pod autoscaler and how workloads can be scaled by adding more replicas. To get started I created a kuard replica set with 800 pods. While this is a useless test I did it for one reason. To see if anything broke and to understand where (command output, logs, etc.) to look to find the reason it broke. After a minute or two kubectl get rs
stauled at 544 pods:
$ kubectl get rs kuard
NAME DESIRED CURRENT READY AGE
kuard 800 800 544 1h
The remaining pods were all in a Pending state. The load on my systems was minimal (plenty of memory, CPU, disk and IOPs) and kubectl get nodes showed plenty of available resources. When I described one of the pending pods:
$ kubectl describe pod kuard-zxr4r
Name: kuard-zxr4r
Namespace: default
Node: <none>
Labels: run=kuard
Annotations: <none>
Status: Pending
......
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 56m (x17 over 1h) default-scheduler 0/5 nodes are available: 1 NodeNotReady, 5 Insufficient pods.
Warning FailedScheduling 1m (x222 over 1h) default-scheduler 0/5 nodes are available: 5 Insufficient pods.
It reported that there was an insufficient number of pods. After an hour reading through the Kubernetes code I came across a comment about there being a maximum pods per node limit. A bit of google’ing turned up the Building Large Clusters page which mentioned a 100 pod per node limit:
No more than 100 pods per node
If that was accurate the maximum number of pods I could schedule would be 500. But I had 550 PODs running so I wasn’t sure if that value was still accurate. A bit more digging led me to guthub issue 23349 (Increase maximum pods per node). @jeremyeder (thanks!) mentioned there was actually a 110 pod limit per node. This made a ton of sense. My workers run 1 DNS pod (I need to research running one DNS pod per worker) and 5 flannel pods. If you subtract that number from 550 (5 nodes * 110 pods per node) you get the value listed above.
I haven’t been able to find the rationale behind the value of 110. It’s common to get a machine with 40 - 80 CPUs and a half a terrabyte of RAM for a modest price. Curious if the Kubernetes folks have though about using the cluster size and current system metrics to derive this number on the fly? I’m sure there are pros and cons to increasing the maximum number of pods per node but I haven’t been able to find anything authoritative to explain the 110 pod limit or future plans to increase this. Now back to the autoscaler!
*** UPDATE ***
The kubfather mentioned on twitter that the maximum number of pods per node can be increased with the kubelet “–max-pods” option. Not sure how I missed that when I was reviewing the kubelet options.
*** UPDATE 2 ***
Changing the kubelet “–max-pods” option to 250 on each worker addressed my problem:
$ kubectl get rs
NAME DESIRED CURRENT READY AGE
kuard 800 800 800 3h