Locating Kubernetes API Versions

This article was posted by Matty on 2018-02-14 21:00:00 -0500 -0500

If you work with Kubernetes you have most likely worked with deployment manifests. These files define the objects (kind:) you want to create and the version (apiVersion:) of the API to interact with. Here is a sample:

kind: ClusterRoleBinding
apiVersion: rbac.authorization.k8s.io/v1beta1
metadata:
  name: flannel
roleRef:
  apiGroup: rbac.authorization.k8s.io
  kind: ClusterRole
  name: flannel
subjects:
- kind: ServiceAccount
  name: flannel
  namespace: kube-system

Kubernetes is a work in progress so the APIs you interact with are versioned to reflect their current sate. The official API documentation lists three possible states:

Alpha level - Work in progress w/o any guarantees.
Beta level - Well tested, features won’t be dropped but details may change.
Stable level - Production ready and will be available for the known future.

What is super interesting is that Kubernetes supports multiple API versions, and these versions can be made more granular by versioning at specific API paths:

To make it easier to eliminate fields or restructure resource representations, Kubernetes supports multiple API versions, each at a different API path, such as /api/v1 or /apis/extensions/v1beta1.

Another interesting factoid (at least it is to me) is that versions are applied at the API level:

We chose to version at the API level rather than at the resource or field level to ensure that the API presents a clear, consistent view of system resources and behavior, and to enable controlling access to end-of-lifed and/or experimental APIs.

This is super cool but you often want to know WHICH versions are supported by a given Kubernetes release. Luckily for us the kubectl utility has an “api-versions” command to list the versions supported by a given release:

$ kubectl api-versions

apiextensions.k8s.io/v1beta1
apiregistration.k8s.io/v1beta1
apps/v1beta1
apps/v1beta2
authentication.k8s.io/v1
authentication.k8s.io/v1beta1
authorization.k8s.io/v1
authorization.k8s.io/v1beta1
autoscaling/v1
autoscaling/v2beta1
batch/v1
batch/v1beta1
certificates.k8s.io/v1beta1
extensions/v1beta1
networking.k8s.io/v1
policy/v1beta1
rbac.authorization.k8s.io/v1
rbac.authorization.k8s.io/v1beta1
storage.k8s.io/v1
storage.k8s.io/v1beta1
v1

As I read more about the API I’m blown away by how well thought out it is. Viva la K8S!

Creating Kubernetes clusters with kubeadm on Fedora Linux servers

This article was posted by Matty on 2018-02-14 19:00:00 -0500 -0500

I recently watched Joe Beda’s TGIk on kubeadm. This was an excellent introduction to kubeadm and I found some time Sunday night to play around with it. The kubeadm installation guide was pretty straight forward but I hit a few gotchas getting my cluster working. Below are the steps I took to get a kubeadm cluster running with Vagrant and Fedora 27. To get started I installed virtual box and vagrant on my laptop. Once both packages were installed I ran my bootstrap script to create 3 cluster nodes:

$ git clone https://github.com/Matty9191/kubernetes.git

$ kubernetes/scripts/create-kubeadm-cluster

The script will create 3 Fedora 27 Vagrant boxes and install the latest version of Kubernetes (a version can be passed to the script as arg1 as well). It will also update the hosts file, add a couple of sysctl values and create the Kubernetes yum repository file. To prepare the cluster you will need to pick one node to run the control plane functions (etcd, API server, scheduler, controller). Once you identify a node you can log in and fire up the kubelet:

$ systemctl enable kubelet && systemctl start kubelet

To provision the control plane you can can run kubeadm init:

$ kubeadm init --pod-network-cidr=10.244.0.0/16 --service-cidr=10.10.0.0/16 --apiserver-advertise-address=10.10.10.101

1.30
[init] Using Kubernetes version: v1.9.3
[init] Using Authorization modes: [Node RBAC]
[preflight] Running pre-flight checks.
	[WARNING FileExisting-crictl]: crictl not found in system path
[preflight] Starting the kubelet service
[certificates] Generated ca certificate and key.
[certificates] Generated apiserver certificate and key.
[certificates] apiserver serving cert is signed for DNS names [kub1 kubernetes kubernetes.default kubernetes.default.svc kubernetes.default.svc.cluster.local] and IPs [10.10.0.1 10.10.10.101]
[certificates] Generated apiserver-kubelet-client certificate and key.
[certificates] Generated sa key and public key.
[certificates] Generated front-proxy-ca certificate and key.
[certificates] Generated front-proxy-client certificate and key.
[certificates] Valid certificates and keys now exist in "/etc/kubernetes/pki"
[kubeconfig] Wrote KubeConfig file to disk: "admin.conf"
[kubeconfig] Wrote KubeConfig file to disk: "kubelet.conf"
[kubeconfig] Wrote KubeConfig file to disk: "controller-manager.conf"
[kubeconfig] Wrote KubeConfig file to disk: "scheduler.conf"
[controlplane] Wrote Static Pod manifest for component kube-apiserver to "/etc/kubernetes/manifests/kube-apiserver.yaml"
[controlplane] Wrote Static Pod manifest for component kube-controller-manager to "/etc/kubernetes/manifests/kube-controller-manager.yaml"
[controlplane] Wrote Static Pod manifest for component kube-scheduler to "/etc/kubernetes/manifests/kube-scheduler.yaml"
[etcd] Wrote Static Pod manifest for a local etcd instance to "/etc/kubernetes/manifests/etcd.yaml"
[init] Waiting for the kubelet to boot up the control plane as Static Pods from directory "/etc/kubernetes/manifests".
[init] This might take a minute or longer if the control plane images have to be pulled.
[apiclient] All control plane components are healthy after 83.004315 seconds
[uploadconfig] Storing the configuration used in ConfigMap "kubeadm-config" in the "kube-system" Namespace
[markmaster] Will mark node kub1 as master by adding a label and a taint
[markmaster] Master kub1 tainted and labelled with key/value: node-role.kubernetes.io/master=""
[bootstraptoken] Using token: 078ce3.486bf3405ee8b160
[bootstraptoken] Configured RBAC rules to allow Node Bootstrap tokens to post CSRs in order for nodes to get long term certificate credentials
[bootstraptoken] Configured RBAC rules to allow the csrapprover controller automatically approve CSRs from a Node Bootstrap Token
[bootstraptoken] Configured RBAC rules to allow certificate rotation for all node client certificates in the cluster
[bootstraptoken] Creating the "cluster-info" ConfigMap in the "kube-public" namespace
[addons] Applied essential addon: kube-dns
[addons] Applied essential addon: kube-proxy

Your Kubernetes master has initialized successfully!

To start using your cluster, you need to run the following as a regular user:

  mkdir -p $HOME/.kube
  sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
  sudo chown $(id -u):$(id -g) $HOME/.kube/config

You should now deploy a pod network to the cluster.
Run "kubectl apply -f [podnetwork].yaml" with one of the options listed at:
  https://kubernetes.io/docs/concepts/cluster-administration/addons/

You can now join any number of machines by running the following on each node
as root:

  kubeadm join --token XXXXXXX 10.10.10.101:6443 --discovery-token-ca-cert-hash sha256:388761e7df7b5026d1194254b710ed6de9e83da8a0600314dc22353b90bc9b31

The kubeadm command line listed above specifies the pod and service CIDR ranges as well as the IP the API server should listen for requests on. If you encounter a warning or a kubeadm error you can use kubeadm reset to revert the changes that were made. This will return your system to the state it was in prior to init running. To finish the installation you need to set up a networking solution. I’m using flannel’s host-gw but there are several other options available. To deploy flannel I created a kubeconfig using the “To start using your cluster” steps listed above:

$ mkdir -p $HOME/.kube

$ sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config

$ sudo chown $(id -u):$(id -g) $HOME/.kube/config

Next I ran curl to retrieve the deployment manifest (you should review the YAML file before applying it to the cluster) and kubectl create to apply the host-gw flannel configuration:

$ curl https://raw.githubusercontent.com/coreos/flannel/v0.9.1/Documentation/kube-flannel.yml | sed 's/vxlan/host-gw/g' | sed 's/kube-subnet-mgr\"/&\, \"-iface\"\, \"eth1\"/' > flannel.yml

$ kubectl create -f flannel.yml

If no errors were generated the remaining nodes can be joined to the cluster with kubeadm join:

$ kubeadm join --token XXXXXXXXX 10.10.10.101:6443 --discovery-token-ca-cert-hash sha256:f4df190b26844aecabd03510f79ef48a2501f48b810da739f9380616fb83369d

[kubeadm] WARNING: kubeadm is in beta, please do not use it for production clusters.
[preflight] Running pre-flight checks
[discovery] Trying to connect to API Server "10.10.10.101:6443"
[discovery] Created cluster-info discovery client, requesting info from "https://10.10.10.101:6443"
[discovery] Requesting info from "https://10.10.10.101:6443" again to validate TLS against the pinned public key
[discovery] Cluster info signature and contents are valid and TLS certificate validates against pinned roots, will use API Server "10.10.10.101:6443"
[discovery] Successfully established connection with API Server "10.10.10.101:6443"
[bootstrap] Detected server version: v1.8.8
[bootstrap] The server supports the Certificates API (certificates.k8s.io/v1beta1)

Node join complete:
* Certificate signing request sent to master and response
  received.
* Kubelet informed of new secure connection details.

Run 'kubectl get nodes' on the master to see this machine join.

If everything went smotthly you should have a 3-node cluster:

$ kubectl get nodes

NAME      STATUS    ROLES     AGE       VERSION
kub1      Ready     master    3m        v1.9.3
kub2      Ready     <none>    36s       v1.9.3
kub3      Ready     <none>    20s       v1.9.3

And a number of pods should be running in the kube-system namespace:

$ kubectl get pods -n kube-system

NAME                           READY     STATUS    RESTARTS   AGE
etcd-kub1                      1/1       Running   0          15m
kube-apiserver-kub1            1/1       Running   0          15m
kube-controller-manager-kub1   1/1       Running   0          15m
kube-dns-545bc4bfd4-lqwm2      3/3       Running   0          16m
kube-flannel-ds-25zfb          1/1       Running   0          14m
kube-flannel-ds-46r98          1/1       Running   0          5m
kube-flannel-ds-fnmf5          1/1       Running   0          5m
kube-proxy-czhb5               1/1       Running   0          5m
kube-proxy-dv4fb               1/1       Running   0          16m
kube-proxy-mw2x8               1/1       Running   0          5m
kube-scheduler-kub1            1/1       Running   0          15m

While the installation process was relatively straight forward there were a few gotchas:

Make sure SELinux is disabled. Kubeadm doesn’t play well with it (this is noted in the docs). :(
Double and triple check the network sysctl values.
Make sure your node names can resolve to IPs. You can stick them in /etc/hosts if you like.

The Kubernetes 110 pod limit per node

This article was posted by Matty on 2018-02-10 17:03:24 -0500 -0500

This afternoon I decided to dig into the Kubernetes pod autoscaler and how workloads can be scaled by adding more replicas. To get started I created a kuard replica set with 800 pods. While this is a useless test I did it for one reason. To see if anything broke and to understand where (command output, logs, etc.) to look to find the reason it broke. After a minute or two kubectl get rs stauled at 544 pods:

$ kubectl get rs kuard

NAME      DESIRED   CURRENT   READY     AGE
kuard     800       800       544       1h

The remaining pods were all in a Pending state. The load on my systems was minimal (plenty of memory, CPU, disk and IOPs) and kubectl get nodes showed plenty of available resources. When I described one of the pending pods:

$ kubectl describe pod kuard-zxr4r

Name:           kuard-zxr4r
Namespace:      default
Node:           <none>
Labels:         run=kuard
Annotations:    <none>
Status:         Pending
......
  Type     Reason            Age                From               Message
  ----     ------            ----               ----               -------
  Warning  FailedScheduling  56m (x17 over 1h)  default-scheduler  0/5 nodes are available: 1 NodeNotReady, 5 Insufficient pods.
  Warning  FailedScheduling  1m (x222 over 1h)  default-scheduler  0/5 nodes are available: 5 Insufficient pods.

It reported that there was an insufficient number of pods. After an hour reading through the Kubernetes code I came across a comment about there being a maximum pods per node limit. A bit of google’ing turned up the Building Large Clusters page which mentioned a 100 pod per node limit:

No more than 100 pods per node

If that was accurate the maximum number of pods I could schedule would be 500. But I had 550 PODs running so I wasn’t sure if that value was still accurate. A bit more digging led me to guthub issue 23349 (Increase maximum pods per node). @jeremyeder (thanks!) mentioned there was actually a 110 pod limit per node. This made a ton of sense. My workers run 1 DNS pod (I need to research running one DNS pod per worker) and 5 flannel pods. If you subtract that number from 550 (5 nodes * 110 pods per node) you get the value listed above.

I haven’t been able to find the rationale behind the value of 110. It’s common to get a machine with 40 - 80 CPUs and a half a terrabyte of RAM for a modest price. Curious if the Kubernetes folks have though about using the cluster size and current system metrics to derive this number on the fly? I’m sure there are pros and cons to increasing the maximum number of pods per node but I haven’t been able to find anything authoritative to explain the 110 pod limit or future plans to increase this. Now back to the autoscaler!

*** UPDATE ***

The kubfather mentioned on twitter that the maximum number of pods per node can be increased with the kubelet “–max-pods” option. Not sure how I missed that when I was reviewing the kubelet options.

*** UPDATE 2 ***

Changing the kubelet “–max-pods” option to 250 on each worker addressed my problem:

$ kubectl get rs

NAME      DESIRED   CURRENT   READY     AGE
kuard     800       800       800       3h

Using Terraform taint and Kubernetes cordon To rebuild nodes with no service interruption

This article was posted by Matty on 2018-02-10 11:32:08 -0500 -0500

While digging through the Kubernetes networking stack I needed to install a number of tools to make analyzing the system a bit easier. I run all of my systems with a minimal image and haven’t had time to study debugging side cars in depth (it’s definitely on the list of things to learn). Once I completed my work I wanted to return my system to the state it was in prior to the additional packages being installed. While I could have run yum remove to purge the packages I thought this would be an excellent opportunity to learn more about the Kubernetes node add and removal process.

Terraform has a really interesting taint option which forces a resource to be destroyed and recreated. The Kubernetes kubectl command has the drain, cordon and uncordon commands which allow you to disable resource scheduling and evict pods from a node. When I stitched these two together I was able to rebuild the node with zero impact to my cluster.

To move pods off of the worker I wanted to rebuild I used the kubectl drain command:

$ kubectl drain kubworker1.prefetch.net

node "kubworker1.prefetch.net" cordoned
pod "kuard-lpzlr" evicted
node "kubworker1.prefetch.net" drained

This will first cordon (disable resource scheduling) the node then evict any pods that are running. Once the opration completes you can list your nodes to verify that scheduling was disabled:

$ kubectl get nodes

NAME                       STATUS                     ROLES     AGE       VERSION
kubworker1.prefetch.net   Ready,SchedulingDisabled   <none>    22h       v1.9.2
kubworker2.prefetch.net   Ready                      <none>    22h       v1.9.2
kubworker3.prefetch.net   Ready                      <none>    22h       v1.9.2
kubworker4.prefetch.net   Ready                      <none>    22h       v1.9.2
kubworker5.prefetch.net   Ready                      <none>    22h       v1.9.2

To remove the node from the cluster I used the kubectl delete command:

$ kubectl delete node kubworker1.prefetch.net

Now that the node was purged from Kubernetes I needed to taint it with Terraform to force a rebuild. The terraform taint option needs to be used with extreme care! One mistake and you can say goodbye to resources you didn’t intend to destroy (so use this information at your own risk). To locate the resource I wanted to taint I ran terraform show and grep’ed for the keyword referring to the node:

$ terraform show | grep vsphere_virtual_machine.kubernetes_worker

vsphere_virtual_machine.kubernetes_worker.0:
vsphere_virtual_machine.kubernetes_worker.1:
vsphere_virtual_machine.kubernetes_worker.2:
vsphere_virtual_machine.kubernetes_worker.3:
vsphere_virtual_machine.kubernetes_worker.4:

To find the right node I checked the name field in the terraform show output with the name of the Kubernetes node. Once I was 100% certain I had the right node I tainted the resource:

$ terraform taint vsphere_virtual_machine.kubernetes_worker.0

The resource vsphere_virtual_machine.kubernetes_worker.0 in the module root has been marked as tainted!

Once the resource was tainted I ran Terraform plan to verify the steps that would be performed during the next apply. This is a CRITICAL step and I would highly suggest getting a second set of eyes to review your plan. Everything looked good so I applied my changes which caused the server to begin rebuilding and eventually add itself back to the cluster. I loves me some Terraform and Kubernetes!

Notes from episode 6 of TGIK: Kubeadm

This article was posted by Matty on 2018-02-08 20:00:00 -0500 -0500

Over the past few months I’ve been trying to learn everything there is to know about Kubernetes. Kubernetes is an amazing technology for deploying and scaling containers though it comes with a cost. It’s an incredibly complex piece of software and there are a ton of bells and whistles to become familiar with. One way that I’ve found for coming up to speed is Joe Beda’s weekly TGIK live broadcast. This occurs each Friday at 4PM EST and is CHOCK full of fantastic information. In episode six Joe talks about kubeadm. You can watch it here:

Here are some of my takeways from the episode:

Heptio opensourced Ark which allows you to backup & recover a Kubernetes cluster.
Heptio opensourced sonobuoy which allows you to perform cluster conformance tests. Super cool!
AWS IAM roles can be attached to instances and updated. There was a bit of confusion on this point.
At the time this video was shot kubeadm didn’t support HA. It looks like kubeadm can build HA clusters now.
Kubeadm cluster installation process
- Install docker on each host
- Install Kubernetes repo on each host
- Install kubernetes on each node
- kubeadm init can be used to initialize the control plane
- Join the workers to the control place with kubeadm join
- Install a networking solution
Kubeadm enables the node and RBAC authorizers by default.
When a node joins the cluster it presents a bootstrap token to the API server’s authenticator service. These tokens can be configured to expire at periodic intervals as of Kubernetes 1.8.
Kubeadm configures all of the client and server certificates. It also utilizes the built-in CA.
The node authorizer requires consistent naming to work. This bit me in the past!.
You can talk to the API server from inside the cluster using it’s well known ClusterIP (Uses x.x.x.1).
Kubeadm assigns one certificate to each service to improve security.
The API server “-apiserver-cert-extra-sans” option allows you to extend the list of names your API server can be known as. This will add the additional names to the Subject Alternate Name section of the certificate.
Kubeadm runs all of the control plane services (api server, schduler and controller) as Kubernetes pods. There are pros and cons to this.
The kubelet has two modes of operation:
- Kubelet reaches out to the API server to get its instructions.
- Kubelet uses files on disk to control what it will run (static pods).
The only service managed by systemd on a kubeadm cluster is the kubelet that starts the other services from static files.
Kubeadm adds the kube-proxy and kube-dns server add-ons.
Two interesting projects
- bootkube is a tool for launching self-hosted Kubernetes clusters.
- Kubelet checkpointing is a way for the kubelet to save the current state for future restarts.
You can retrieve boostrap tokens with kubeadm:
$ kubeadm token list
To join a node to the cluster you can use the kubeadm join option:
$ kubeadm join --token xxxx.xxxxxxx APISERVERIP:6443
ConfigMap in kube-public cotnains the cluster information needed to bootstrap a worker. Uses the token to verify the information is from a trusted location.
Kubeadm uses the Kubernetes certificate API on each worker to create a CSR and to get a signed certificate so it can talk to the API server.
Certificate rotation is coming in to a Kubernetes release near you.
To view the cluster configuration kubeadm will use you can retrieve it from the kube-public namespace with curl:
$ curl https://api_server_ip/api/v1/namedpsaces/kube-public/configmaps/cluster-info
Joe has super powers.

Things I need to learn more about:

Need to build a cluster with kubeadm vs. using Kelsey Hightower’s KTHW method.
Provision a cluster on Azure with the credits I have there.
Play with sonobuoy and check to see if reporting can be automated (possible w/ promethus and a metrics interface?).
Poke around the control plane to see how static pod configs look.