This morning I wanted to better understand how requests to ClusterIPs get routed to Kubernetes pods. Properly functioning networking is critical to Kubernetes and having a solid understanding of what happens under the covers makes debugging problems much, much easier. To get started with my studies I fired up five kuard pods:
$ kubectl create -f kuard.yaml
replicaset "kuard" created
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE
kuard-8xwx7 1/1 Running 0 36s 10.1.4.3 kubworker4.prefetch.net
kuard-bd4cj 1/1 Running 0 36s 10.1.1.3 kubworker2.prefetch.net
kuard-hfkgd 1/1 Running 0 36s 10.1.2.4 kubworker5.prefetch.net
kuard-j9fks 1/1 Running 0 36s 10.1.0.3 kubworker3.prefetch.net
kuard-lpzlr 1/1 Running 0 36s 10.1.3.3 kubworker1.prefetch.net
I created 5 pods so one would hopefully be placed on each worker node. Once the pods finished creating I exposed the pods to the cluster with the kubectl expose command:
$ kubectl expose rs kuard --port=8080 --target-port=8080
$ kubectl get svc -o wide kuard
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE SELECTOR
kuard ClusterIP 10.2.21.155 <none> 8080/TCP 20s run=kuard
Behind the scenes kube-proxy uses iptables-save and iptables-restore to add rules. Here is the first rule that applies to the kuard service I exposed above:
-A KUBE-SERVICES -d 10.2.21.155/32 -p tcp -m comment --comment "default/kuard: cluster IP" -m tcp --dport 8080 -j KUBE-SVC-CUXC5A3HHHVSSN62
This rule checks if the destination (argument to “-d”) matches the cluster IP, the destination port (argument to –dport) is 8080 and the protocol (argument to “-p”) is tcp. If that check passes the rule will jump to the KUBE-SVC-CUXC5A3HHHVSSN62 target. Here are the rules in the KUBE-SVC-CUXC5A3HHHVSSN62 chain:
-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.20000000019 -j KUBE-SEP-CA6TP3H7ZVLC3JFW
-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.25000000000 -j KUBE-SEP-ZHHZWPGVXXVHUF5F
-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.33332999982 -j KUBE-SEP-H2VR42IC623XBWYH
-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -m statistic --mode random --probability 0.50000000000 -j KUBE-SEP-AXZRC2VTEV7ZDZ2C
-A KUBE-SVC-CUXC5A3HHHVSSN62 -m comment --comment "default/kuard:" -j KUBE-SEP-5NFQVOMYN3PVBXGK
This chain contains one rule per pod. Each pod is assigned a probablility and the iptables-extension statistic module is used to pick the node that best matches. Once a node is selected iptables will jump to the target passed to “-j”. Here are the chains it will jump to:
-A KUBE-SEP-5NFQVOMYN3PVBXGK -s 10.1.4.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-5NFQVOMYN3PVBXGK -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.4.3:8080
-A KUBE-SEP-AXZRC2VTEV7ZDZ2C -s 10.1.3.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-AXZRC2VTEV7ZDZ2C -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.3.3:8080
-A KUBE-SEP-CA6TP3H7ZVLC3JFW -s 10.1.0.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-CA6TP3H7ZVLC3JFW -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.0.3:8080
-A KUBE-SEP-H2VR42IC623XBWYH -s 10.1.2.4/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-H2VR42IC623XBWYH -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.2.4:8080
-A KUBE-SEP-ZHHZWPGVXXVHUF5F -s 10.1.1.3/32 -m comment --comment "default/kuard:" -j KUBE-MARK-MASQ
-A KUBE-SEP-ZHHZWPGVXXVHUF5F -p tcp -m comment --comment "default/kuard:" -m tcp -j DNAT --to-destination 10.1.1.3:8080
Now here’s where the magic occurs! Once a chain is picked the service IP will be NAT’ed to the destination node’s pod IP via the “–to-destination” option. Traffic will then traverse the hosts public network interface and arrive at the destination where it can be funneled to the pod (it’s pretty amazing and scary how this works behind the scenes). If I curl the service IP on port 8080:
$ curl 10.2.21.155:8080 > /dev/null
We can see the initial SYN and the translated destination (the IP of the pod to send the request to) with tcpdump:
$ tcpdump -n -i ens192 port 8080 and 'tcp[tcpflags] == tcp-syn'
tcpdump: verbose output suppressed, use -v or -vv for full protocol decode
listening on ens192, link-type EN10MB (Ethernet), capture size 262144 bytes
13:02:09.129502 IP 192.168.2.44.48102 > 10.1.2.4.webcache: Flags [S], seq 811928755, win 29200, options [mss 1460,sackOK,TS val 3048081500 ecr 0,nop,wscale 7], length 0
The rules utilize the connmark mark option to mark packets. I’m not 100% sure how this works (or the purpose) and will need to do some digging this weekend to see what the deal is. I learned a lot digging through packet captures and iptables and defintely have a MUCH better understanding of how pods and service IPs play with each other.