Containers are pretty amazing. What’s more amazing is the technology in the Linux kernel that makes them a reality. As Jesse Frazelle mentioned in her containers aka crazy user space fun keynote containers aren’t actual things. They are a combination of Linux kernel technologies. Specifically, cgroups, namespaces and LSMs. To get a better understanding of how namespaces work I sat down last weekend and read a bunch of code and dug into some containers running on one of my Kubernetes workers. I will share what I found in this blog post.
So what exactly is a Linux namespace? The manual page for namespaces(7) gives an excellent overview:
A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers.
Their are currently seven types of namespaces in the Linux kernel:
To see the namespaces on a server you can run the lsns utility:
$ lsns
NS TYPE NPROCS PID USER COMMAND
4026531835 cgroup 122 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531836 pid 114 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531837 user 122 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531838 uts 115 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531839 ipc 114 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531840 mnt 110 1 root /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026532803 mnt 1 1491 root /pause
4026532804 uts 1 1491 root /pause
4026532805 ipc 2 1491 root /pause
4026532806 pid 1 1491 root /pause
4026532810 net 2 1491 root /pause
4026532884 mnt 1 3075 root /usr/local/bin/kube-router
4026532885 pid 1 3075 root /usr/local/bin/kube-router
4026532889 mnt 1 1768 nobody /app1
4026532890 uts 1 1768 nobody /app2
4026532891 pid 1 1768 nobody /app3
.......
Processes can be assigned to a namespace in two ways. The first is via one or more CLONE_NEWXXXX flags passed to the clone system call. These are all documented in namespaces(7). The second way is through the setns(2) system call. Clone allows new newspaces to be created and assigned to a process while setns adds a process to an existing namespace (this is how sidecars work). We can observe how setns works with the nsenter and bcc trace utilities:
$ nsenter -t 29221 --mount --uts --ipc --net --pid sh
$ trace 'sys_setns "setns was called: FD: %d NSTYPE: 0x%x", arg1, arg2'
PID TID COMM FUNC -
30465 30465 nsenter sys_setns setns was called: FD: 3 NSTYPE: 0x8000000
30465 30465 nsenter sys_setns setns was called: FD: 4 NSTYPE: 0x4000000
30465 30465 nsenter sys_setns setns was called: FD: 5 NSTYPE: 0x40000000
30465 30465 nsenter sys_setns setns was called: FD: 6 NSTYPE: 0x20000000
30465 30465 nsenter sys_setns setns was called: FD: 7 NSTYPE: 0x20000
In the example above I’m using the nsenter utility to enter the mount, uts, ipc, net and pid namespaces for process id 29221. Once the setns() system call completes execve() will start the program passed as an argument in that namespace. NSTYPE is displayed as a hexidecimal number but the actual namespace type is passed to setns as an int. The Linux kernel maintains a list of namespace types in /usr/include/linux/sched.h which we can view with grep:
$ grep CLONE_NEW /usr/include/linux/sched.h
#define CLONE_NEWNS 0x00020000 /* New mount namespace group */
#define CLONE_NEWCGROUP 0x02000000 /* New cgroup namespace */
#define CLONE_NEWUTS 0x04000000 /* New utsname namespace */
#define CLONE_NEWIPC 0x08000000 /* New ipc namespace */
#define CLONE_NEWUSER 0x10000000 /* New user namespace */
#define CLONE_NEWPID 0x20000000 /* New pid namespace */
#define CLONE_NEWNET 0x40000000 /* New network namespace */
The third field represents the bit pattern for the CLONE flag and can be matched up to the NSTYPE value listed above. Each processs on a system has their process id registered in /proc. Inside that pid directory is a ns directory which lists all of the namespaces (and their ids) associated with the process. Let’s take a look at the ns entries for a docker shim process running on a Kubernetes worker:
$ cd /proc/3878/ns
$ ls -la
total 0
dr-x--x--x. 2 root root 0 Feb 15 20:54 .
dr-xr-xr-x. 9 root root 0 Feb 15 20:54 ..
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 cgroup -> cgroup:[4026531835]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 net -> net:[4026531993]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 pid -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 pid_for_children -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 user -> user:[4026531837]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 uts -> uts:[4026531838]
In the output above we can see that there is one entry per namespace type. Now you may be asking yourself how does all of this relate to Kubernetes? Well I’m glad you are a kickass reader and posed that question! When the kubelet is asked to create a new container it will actually spin up two. One container will be used to execute your application and the second container will be used to run a pause container. If you want to learn all about the differences between the two types of containers you can check out the amazing pod and pause container posts from Ian Lewis.
Now what makes this interesting is the way the namespaces are shared between containers in a pod. To better understand the relationships between containers in a namespace I fired up a pod with kubectl:
$ kubectl run -it --image=centos centos bash
Next I located my pods with the docker ps command:
$ docker ps -a | grep centos
53856330119b docker.io/centos@sha256:2671f7a3eea36ce43609e9fe7435ade83094291055f1c96d9d1d1d7c0b986a5d "bash" 5 minutes ago Up 5 minutes k8s_centos_centos-6b448dc49b-h89dp_default_d594cab9-12ba-11e8-97d5-525400b9b561_0
aac7ecdfe5e4 gcr.io/google_containers/pause-amd64:3.0 "/pause" 5 minutes ago Up 5 minutes k8s_POD_centos-6b448dc49b-h89dp_default_d594cab9-12ba-11e8-97d5-525400b9b561_0
This turned up two containers. One for the pause container and another for the application I am trying run. To further understand the relationship between both containers I grabbed their PIDs:
$ docker inspect $(docker ps -a | awk '/centos/ { print $1}') -f '{{ .State.Pid }}'
2663
2589
And spent a few minutes developing a Python script to list namespace ids and the process ids that are associated with them. Running the script and grep’ing for the PIDs above produced the following:
$ sharedns.py | egrep '2663|2589|Name'
Namespace ID Namespace Pids Sharing Namespace
4026532249 mnt 2589
4026531835 cgroup 1,2,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,32,49,81,82,83,84,85,86,87,88,89,96,275,292,293,350,365,373,413,432,435,440,466,538,541,642,780,781,784,785,791,909,913,916,918,923,924,946,947,1004,1094,1095,1129,1132,1183,1201,1937,1952,2576,2589,2649,2663,2769,5195,5196
4026531837 user 1,2,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,32,49,81,82,83,84,85,86,87,88,89,96,275,292,293,350,365,373,413,432,435,440,466,538,541,642,780,781,784,785,791,909,913,916,918,923,924,946,947,1004,1094,1095,1129,1132,1183,1201,1937,1952,2576,2589,2649,2663,2769,5195,5196
4026532254 net 2589,2663
4026532314 mnt 2663
4026532315 uts 2663
4026532316 pid 2663
4026532251 ipc 2589,2663
4026532250 uts 2589
4026532252 pid 2589
This helped me visualize the namespace associations and walk away with the following notes:
If you are using docker and your container needs access to a namespace outside of the default namespaces it was assigned (net & ipc) you can use the following options to add it to additional namespaces:
$ docker run --help | grep namespace
--ipc string IPC namespace to use
--pid string PID namespace to use
--userns string User namespace to use
--uts string UTS namespace to use
This was an amazing journey and I feel like I have a solid grasp of how namespaces work now. If you want to learn more about the actual implementation inside the Linux kernel you can check out LWNs two part series (part one and part two) on this super cool topic! Hit me up on twitter if you have any feedback!