If you work with Kubernetes you have most likely worked with deployment manifests. These files define the objects (kind:) you want to create and the version (apiVersion:) of the API to interact with. Here is a sample:
kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
name: flannel
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: flannel
subjects:
- kind: ServiceAccount
name: flannel
namespace: kube-system
Kubernetes is a work in progress so the APIs you interact with are versioned to reflect their current sate. The official API documentation lists three possible states:
What is super interesting is that Kubernetes supports multiple API versions, and these versions can be made more granular by versioning at specific API paths:
To make it easier to eliminate fields or restructure resource representations, Kubernetes supports multiple API versions, each at a different API path, such as /api/v1 or /apis/extensions/v1beta1.
Another interesting factoid (at least it is to me) is that versions are applied at the API level:
We chose to version at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-lifed and/or experimental APIs.
This is super cool but you often want to know WHICH versions are supported by a given Kubernetes release. Luckily for us the kubectl utility has an “api-versions” command to list the versions supported by a given release:
$ kubectl api-versions
apiextensions.k8s.io/v1beta1
apiregistration.k8s.io/v1beta1
apps/v1beta1
apps/v1beta2
authentication.k8s.io/v1
authentication.k8s.io/v1beta1
authorization.k8s.io/v1
authorization.k8s.io/v1beta1
autoscaling/v1
autoscaling/v2beta1
batch/v1
batch/v1beta1
certificates.k8s.io/v1beta1
extensions/v1beta1
networking.k8s.io/v1
policy/v1beta1
rbac.authorization.k8s.io/v1
rbac.authorization.k8s.io/v1beta1
storage.k8s.io/v1
storage.k8s.io/v1beta1
v1
As I read more about the API I’m blown away by how well thought out it is. Viva la K8S!
I recently watched Joe Beda’s TGIk on kubeadm. This was an excellent introduction to kubeadm and I found some time Sunday night to play around with it. The kubeadm installation guide was pretty straight forward but I hit a few gotchas getting my cluster working. Below are the steps I took to get a kubeadm cluster running with Vagrant and Fedora 27. To get started I installed virtual box and vagrant on my laptop. Once both packages were installed I ran my bootstrap script to create 3 cluster nodes:
$ git clone https://github.com/Matty9191/kubernetes.git
$ kubernetes/scripts/create-kubeadm-cluster
The script will create 3 Fedora 27 Vagrant boxes and install the latest version of Kubernetes (a version can be passed to the script as arg1 as well). It will also update the hosts file, add a couple of sysctl values and create the Kubernetes yum repository file. To prepare the cluster you will need to pick one node to run the control plane functions (etcd, API server, scheduler, controller). Once you identify a node you can log in and fire up the kubelet:
$ systemctl enable kubelet && systemctl start kubelet
To provision the control plane you can can run kubeadm init:
$ kubeadm init --pod-network-cidr=10.244.0.0/16 --service-cidr=10.10.0.0/16 --apiserver-advertise-address=10.10.10.101
1.30
[init] Using Kubernetes version: v1.9.3
[init] Using Authorization modes: [Node RBAC]
[preflight] Running pre-flight checks.
[WARNING FileExisting-crictl]: crictl not found in system path
[preflight] Starting the kubelet service
[certificates] Generated ca certificate and key.
[certificates] Generated apiserver certificate and key.
[certificates] apiserver serving cert is signed for DNS names [kub1 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.10.0.1 10.10.10.101]
[certificates] Generated apiserver-kubelet-client certificate and key.
[certificates] Generated sa key and public key.
[certificates] Generated front-proxy-ca certificate and key.
[certificates] Generated front-proxy-client certificate and key.
[certificates] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "scheduler.conf"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
[init] Waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests".
[init] This might take a minute or longer if the control plane images have to be pulled.
[apiclient] All control plane components are healthy after 83.004315 seconds
[uploadconfig] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[markmaster] Will mark node kub1 as master by adding a label and a taint
[markmaster] Master kub1 tainted and labelled with key/value: node-role.kubernetes.io/master=""
[bootstraptoken] Using token: 078ce3.486bf3405ee8b160
[bootstraptoken] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstraptoken] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstraptoken] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstraptoken] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[addons] Applied essential addon: kube-dns
[addons] Applied essential addon: kube-proxy
Your Kubernetes master has initialized successfully!
To start using your cluster, you need to run the following as a regular user:
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config
You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
https://kubernetes.io/docs/concepts/cluster-administration/addons/
You can now join any number of machines by running the following on each node
as root:
kubeadm join --token XXXXXXX 10.10.10.101:6443 --discovery-token-ca-cert-hash sha256:388761e7df7b5026d1194254b710ed6de9e83da8a0600314dc22353b90bc9b31
The kubeadm command line listed above specifies the pod and service CIDR ranges as well as the IP the API server should listen for requests on. If you encounter a warning or a kubeadm error you can use kubeadm reset
to revert the changes that were made. This will return your system to the state it was in prior to init running. To finish the installation you need to set up a networking solution. I’m using flannel’s host-gw but there are several other options available. To deploy flannel I created a kubeconfig using the “To start using your cluster” steps listed above:
$ mkdir -p $HOME/.kube
$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
$ sudo chown $(id -u):$(id -g) $HOME/.kube/config
Next I ran curl to retrieve the deployment manifest (you should review the YAML file before applying it to the cluster) and kubectl create to apply the host-gw flannel configuration:
$ curl https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml | sed 's/vxlan/host-gw/g' | sed 's/kube-subnet-mgr\"/&\, \"-iface\"\, \"eth1\"/' > flannel.yml
$ kubectl create -f flannel.yml
If no errors were generated the remaining nodes can be joined to the cluster with kubeadm join:
$ kubeadm join --token XXXXXXXXX 10.10.10.101:6443 --discovery-token-ca-cert-hash sha256:f4df190b26844aecabd03510f79ef48a2501f48b810da739f9380616fb83369d
[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[preflight] Running pre-flight checks
[discovery] Trying to connect to API Server "10.10.10.101:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.10.10.101:6443"
[discovery] Requesting info from "https://10.10.10.101:6443" again to validate TLS against the pinned public key
[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "10.10.10.101:6443"
[discovery] Successfully established connection with API Server "10.10.10.101:6443"
[bootstrap] Detected server version: v1.8.8
[bootstrap] The server supports the Certificates API (certificates.k8s.io/v1beta1)
Node join complete:
* Certificate signing request sent to master and response
received.
* Kubelet informed of new secure connection details.
Run 'kubectl get nodes' on the master to see this machine join.
If everything went smotthly you should have a 3-node cluster:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kub1 Ready master 3m v1.9.3
kub2 Ready <none> 36s v1.9.3
kub3 Ready <none> 20s v1.9.3
And a number of pods should be running in the kube-system namespace:
$ kubectl get pods -n kube-system
NAME READY STATUS RESTARTS AGE
etcd-kub1 1/1 Running 0 15m
kube-apiserver-kub1 1/1 Running 0 15m
kube-controller-manager-kub1 1/1 Running 0 15m
kube-dns-545bc4bfd4-lqwm2 3/3 Running 0 16m
kube-flannel-ds-25zfb 1/1 Running 0 14m
kube-flannel-ds-46r98 1/1 Running 0 5m
kube-flannel-ds-fnmf5 1/1 Running 0 5m
kube-proxy-czhb5 1/1 Running 0 5m
kube-proxy-dv4fb 1/1 Running 0 16m
kube-proxy-mw2x8 1/1 Running 0 5m
kube-scheduler-kub1 1/1 Running 0 15m
While the installation process was relatively straight forward there were a few gotchas:
This afternoon I decided to dig into the Kubernetes pod autoscaler and how workloads can be scaled by adding more replicas. To get started I created a kuard replica set with 800 pods. While this is a useless test I did it for one reason. To see if anything broke and to understand where (command output, logs, etc.) to look to find the reason it broke. After a minute or two kubectl get rs
stauled at 544 pods:
$ kubectl get rs kuard
NAME DESIRED CURRENT READY AGE
kuard 800 800 544 1h
The remaining pods were all in a Pending state. The load on my systems was minimal (plenty of memory, CPU, disk and IOPs) and kubectl get nodes showed plenty of available resources. When I described one of the pending pods:
$ kubectl describe pod kuard-zxr4r
Name: kuard-zxr4r
Namespace: default
Node: <none>
Labels: run=kuard
Annotations: <none>
Status: Pending
......
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 56m (x17 over 1h) default-scheduler 0/5 nodes are available: 1 NodeNotReady, 5 Insufficient pods.
Warning FailedScheduling 1m (x222 over 1h) default-scheduler 0/5 nodes are available: 5 Insufficient pods.
It reported that there was an insufficient number of pods. After an hour reading through the Kubernetes code I came across a comment about there being a maximum pods per node limit. A bit of google’ing turned up the Building Large Clusters page which mentioned a 100 pod per node limit:
No more than 100 pods per node
If that was accurate the maximum number of pods I could schedule would be 500. But I had 550 PODs running so I wasn’t sure if that value was still accurate. A bit more digging led me to guthub issue 23349 (Increase maximum pods per node). @jeremyeder (thanks!) mentioned there was actually a 110 pod limit per node. This made a ton of sense. My workers run 1 DNS pod (I need to research running one DNS pod per worker) and 5 flannel pods. If you subtract that number from 550 (5 nodes * 110 pods per node) you get the value listed above.
I haven’t been able to find the rationale behind the value of 110. It’s common to get a machine with 40 - 80 CPUs and a half a terrabyte of RAM for a modest price. Curious if the Kubernetes folks have though about using the cluster size and current system metrics to derive this number on the fly? I’m sure there are pros and cons to increasing the maximum number of pods per node but I haven’t been able to find anything authoritative to explain the 110 pod limit or future plans to increase this. Now back to the autoscaler!
*** UPDATE ***
The kubfather mentioned on twitter that the maximum number of pods per node can be increased with the kubelet “–max-pods” option. Not sure how I missed that when I was reviewing the kubelet options.
*** UPDATE 2 ***
Changing the kubelet “–max-pods” option to 250 on each worker addressed my problem:
$ kubectl get rs
NAME DESIRED CURRENT READY AGE
kuard 800 800 800 3h
While digging through the Kubernetes networking stack I needed to install a number of tools to make analyzing the system a bit easier. I run all of my systems with a minimal image and haven’t had time to study debugging side cars in depth (it’s definitely on the list of things to learn). Once I completed my work I wanted to return my system to the state it was in prior to the additional packages being installed. While I could have run yum remove to purge the packages I thought this would be an excellent opportunity to learn more about the Kubernetes node add and removal process.
Terraform has a really interesting taint option which forces a resource to be destroyed and recreated. The Kubernetes kubectl command has the drain, cordon and uncordon commands which allow you to disable resource scheduling and evict pods from a node. When I stitched these two together I was able to rebuild the node with zero impact to my cluster.
To move pods off of the worker I wanted to rebuild I used the kubectl drain command:
$ kubectl drain kubworker1.prefetch.net
node "kubworker1.prefetch.net" cordoned
pod "kuard-lpzlr" evicted
node "kubworker1.prefetch.net" drained
This will first cordon (disable resource scheduling) the node then evict any pods that are running. Once the opration completes you can list your nodes to verify that scheduling was disabled:
$ kubectl get nodes
NAME STATUS ROLES AGE VERSION
kubworker1.prefetch.net Ready,SchedulingDisabled <none> 22h v1.9.2
kubworker2.prefetch.net Ready <none> 22h v1.9.2
kubworker3.prefetch.net Ready <none> 22h v1.9.2
kubworker4.prefetch.net Ready <none> 22h v1.9.2
kubworker5.prefetch.net Ready <none> 22h v1.9.2
To remove the node from the cluster I used the kubectl delete command:
$ kubectl delete node kubworker1.prefetch.net
Now that the node was purged from Kubernetes I needed to taint it with Terraform to force a rebuild. The terraform taint option needs to be used with extreme care! One mistake and you can say goodbye to resources you didn’t intend to destroy (so use this information at your own risk). To locate the resource I wanted to taint I ran terraform show and grep’ed for the keyword referring to the node:
$ terraform show | grep vsphere_virtual_machine.kubernetes_worker
vsphere_virtual_machine.kubernetes_worker.0:
vsphere_virtual_machine.kubernetes_worker.1:
vsphere_virtual_machine.kubernetes_worker.2:
vsphere_virtual_machine.kubernetes_worker.3:
vsphere_virtual_machine.kubernetes_worker.4:
To find the right node I checked the name field in the terraform show output with the name of the Kubernetes node. Once I was 100% certain I had the right node I tainted the resource:
$ terraform taint vsphere_virtual_machine.kubernetes_worker.0
The resource vsphere_virtual_machine.kubernetes_worker.0 in the module root has been marked as tainted!
Once the resource was tainted I ran Terraform plan to verify the steps that would be performed during the next apply. This is a CRITICAL step and I would highly suggest getting a second set of eyes to review your plan. Everything looked good so I applied my changes which caused the server to begin rebuilding and eventually add itself back to the cluster. I loves me some Terraform and Kubernetes!
Over the past few months I’ve been trying to learn everything there is to know about Kubernetes. Kubernetes is an amazing technology for deploying and scaling containers though it comes with a cost. It’s an incredibly complex piece of software and there are a ton of bells and whistles to become familiar with. One way that I’ve found for coming up to speed is Joe Beda’s weekly TGIK live broadcast. This occurs each Friday at 4PM EST and is CHOCK full of fantastic information. In episode six Joe talks about kubeadm. You can watch it here:
Here are some of my takeways from the episode:
kubeadm init
can be used to initialize the control planekubeadm token list
kubeadm join --token xxxx.xxxxxxx APISERVERIP:6443
curl https://api_server_ip/api/v1/namedpsaces/kube-public/configmaps/cluster-info
Things I need to learn more about: