Making sense Of Linux namespaces

This article was posted by Matty on 2018-02-22 20:00:00 -0500 -0500

Containers are pretty amazing. What’s more amazing is the technology in the Linux kernel that makes them a reality. As Jesse Frazelle mentioned in her containers aka crazy user space fun keynote containers aren’t actual things. They are a combination of Linux kernel technologies. Specifically, cgroups, namespaces and LSMs. To get a better understanding of how namespaces work I sat down last weekend and read a bunch of code and dug into some containers running on one of my Kubernetes workers. I will share what I found in this blog post.

So what exactly is a Linux namespace? The manual page for namespaces(7) gives an excellent overview:

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers.

Their are currently seven types of namespaces in the Linux kernel:

Cgroups control the cgroup information a process can see.
IPC controls the System V IPC and POSIX message queues a process can see.
Network controls the network resources a process can see.
Mount controls the set of file systems and mounts a process can see.
PID controls the process IDs a process can see.
User controls the user information (e.g., uids, gids, etc.) a process can see.
UTS controls the hostname and domain information a process can see.

To see the namespaces on a server you can run the lsns utility:

$ lsns

        NS TYPE   NPROCS   PID USER   COMMAND
4026531835 cgroup    122     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531836 pid       114     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531837 user      122     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531838 uts       115     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531839 ipc       114     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531840 mnt       110     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026532803 mnt         1  1491 root   /pause
4026532804 uts         1  1491 root   /pause
4026532805 ipc         2  1491 root   /pause
4026532806 pid         1  1491 root   /pause
4026532810 net         2  1491 root   /pause
4026532884 mnt         1  3075 root   /usr/local/bin/kube-router
4026532885 pid         1  3075 root   /usr/local/bin/kube-router
4026532889 mnt         1  1768 nobody /app1
4026532890 uts         1  1768 nobody /app2
4026532891 pid         1  1768 nobody /app3
.......

Processes can be assigned to a namespace in two ways. The first is via one or more CLONE_NEWXXXX flags passed to the clone system call. These are all documented in namespaces(7). The second way is through the setns(2) system call. Clone allows new newspaces to be created and assigned to a process while setns adds a process to an existing namespace (this is how sidecars work). We can observe how setns works with the nsenter and bcc trace utilities:

$ nsenter -t 29221 --mount --uts --ipc --net --pid sh

$ trace 'sys_setns "setns was called: FD: %d NSTYPE: 0x%x", arg1, arg2'

PID     TID     COMM            FUNC             -
30465   30465   nsenter         sys_setns        setns was called: FD: 3 NSTYPE: 0x8000000
30465   30465   nsenter         sys_setns        setns was called: FD: 4 NSTYPE: 0x4000000
30465   30465   nsenter         sys_setns        setns was called: FD: 5 NSTYPE: 0x40000000
30465   30465   nsenter         sys_setns        setns was called: FD: 6 NSTYPE: 0x20000000
30465   30465   nsenter         sys_setns        setns was called: FD: 7 NSTYPE: 0x20000

In the example above I’m using the nsenter utility to enter the mount, uts, ipc, net and pid namespaces for process id 29221. Once the setns() system call completes execve() will start the program passed as an argument in that namespace. NSTYPE is displayed as a hexidecimal number but the actual namespace type is passed to setns as an int. The Linux kernel maintains a list of namespace types in /usr/include/linux/sched.h which we can view with grep:

$ grep CLONE_NEW /usr/include/linux/sched.h

#define CLONE_NEWNS	0x00020000	/* New mount namespace group */
#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
#define CLONE_NEWUTS		0x04000000	/* New utsname namespace */
#define CLONE_NEWIPC		0x08000000	/* New ipc namespace */
#define CLONE_NEWUSER		0x10000000	/* New user namespace */
#define CLONE_NEWPID		0x20000000	/* New pid namespace */
#define CLONE_NEWNET		0x40000000	/* New network namespace */

The third field represents the bit pattern for the CLONE flag and can be matched up to the NSTYPE value listed above. Each processs on a system has their process id registered in /proc. Inside that pid directory is a ns directory which lists all of the namespaces (and their ids) associated with the process. Let’s take a look at the ns entries for a docker shim process running on a Kubernetes worker:

$ cd /proc/3878/ns

$ ls -la

total 0
dr-x--x--x. 2 root root 0 Feb 15 20:54 .
dr-xr-xr-x. 9 root root 0 Feb 15 20:54 ..
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 cgroup -> cgroup:[4026531835]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 net -> net:[4026531993]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 pid -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 pid_for_children -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 user -> user:[4026531837]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 uts -> uts:[4026531838]

In the output above we can see that there is one entry per namespace type. Now you may be asking yourself how does all of this relate to Kubernetes? Well I’m glad you are a kickass reader and posed that question! When the kubelet is asked to create a new container it will actually spin up two. One container will be used to execute your application and the second container will be used to run a pause container. If you want to learn all about the differences between the two types of containers you can check out the amazing pod and pause container posts from Ian Lewis.

Now what makes this interesting is the way the namespaces are shared between containers in a pod. To better understand the relationships between containers in a namespace I fired up a pod with kubectl:

$ kubectl run -it --image=centos centos bash

Next I located my pods with the docker ps command:

$ docker ps -a | grep centos

53856330119b        docker.io/centos@sha256:2671f7a3eea36ce43609e9fe7435ade83094291055f1c96d9d1d1d7c0b986a5d                            "bash"                   5 minutes ago       Up 5 minutes                                    k8s_centos_centos-6b448dc49b-h89dp_default_d594cab9-12ba-11e8-97d5-525400b9b561_0
aac7ecdfe5e4        gcr.io/google_containers/pause-amd64:3.0                                                                            "/pause"                 5 minutes ago       Up 5 minutes                                    k8s_POD_centos-6b448dc49b-h89dp_default_d594cab9-12ba-11e8-97d5-525400b9b561_0

This turned up two containers. One for the pause container and another for the application I am trying run. To further understand the relationship between both containers I grabbed their PIDs:

$ docker inspect $(docker ps -a | awk '/centos/ { print $1}') -f '{{ .State.Pid }}'

2663
2589

And spent a few minutes developing a Python script to list namespace ids and the process ids that are associated with them. Running the script and grep’ing for the PIDs above produced the following:

$ sharedns.py | egrep '2663|2589|Name'

Namespace ID          Namespace   Pids Sharing Namespace
4026532249            mnt         2589                
4026531835            cgroup      1,2,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,32,49,81,82,83,84,85,86,87,88,89,96,275,292,293,350,365,373,413,432,435,440,466,538,541,642,780,781,784,785,791,909,913,916,918,923,924,946,947,1004,1094,1095,1129,1132,1183,1201,1937,1952,2576,2589,2649,2663,2769,5195,5196
4026531837            user        1,2,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,32,49,81,82,83,84,85,86,87,88,89,96,275,292,293,350,365,373,413,432,435,440,466,538,541,642,780,781,784,785,791,909,913,916,918,923,924,946,947,1004,1094,1095,1129,1132,1183,1201,1937,1952,2576,2589,2649,2663,2769,5195,5196
4026532254            net         2589,2663           
4026532314            mnt         2663                
4026532315            uts         2663                
4026532316            pid         2663                
4026532251            ipc         2589,2663           
4026532250            uts         2589                
4026532252            pid         2589

This helped me visualize the namespace associations and walk away with the following notes:

Each container has its own mnt, uts and pid namespace.
All of the processes on the system share the cgroup namespace (need to dig more into this).
The pause container and application container share the net and ipc namespace.

If you are using docker and your container needs access to a namespace outside of the default namespaces it was assigned (net & ipc) you can use the following options to add it to additional namespaces:

$ docker run --help | grep namespace

      --ipc string                            IPC namespace to use
      --pid string                            PID namespace to use
      --userns string                         User namespace to use
      --uts string                            UTS namespace to use

This was an amazing journey and I feel like I have a solid grasp of how namespaces work now. If you want to learn more about the actual implementation inside the Linux kernel you can check out LWNs two part series (part one and part two) on this super cool topic! Hit me up on twitter if you have any feedback!

Getting yum update notifications by email with yum-cron

This article was posted by Matty on 2018-02-21 08:31:56 -0500 -0500

I run a number of Digital Ocean droplets and like to keep on top of security updates. My droplets run CentOS which provides the yum-cron utility which can be configured to send an e-mail if security updates are available. Configuring yum-cron is a snap. First, you will need to install the package:

$ yum -y install yum-cron

Once the package is installed you can edit /etc/yum/yum-cron.conf to fit your needs. I typically modify update_cmd, system_name, emit_via, email_from and email_to. The update_cmd option allows you to define when e-mails are sent. You can get notifications for all updates or just for security updates. The system_name contains the hostname you want displayed in e-mail and the email_from and email_to contain the addresses used in the From: and To: fields. To enable the service you can use systemctl:

$ systemctl enable yum-cron.service && systemctl start yum-cron.service

Once enabled yum-cron will check for updates at the interval defined in the configuration file and produce an e-mail similar to the following if it detects updates:

The following updates will be downloaded on flip.prefetch.net:
================================================================================
 Package                      Arch   Version                      Repository
                                                                           Size
================================================================================
Installing:
 kernel                       x86_64 3.10.0-693.17.1.el7          updates  43 M
Updating:
 binutils                     x86_64 2.25.1-32.base.el7_4.2       updates 5.4 M
 dhclient                     x86_64 12:4.2.5-58.el7.centos.1     updates 282 k
 dhcp-common                  x86_64 12:4.2.5-58.el7.centos.1     updates 174 k
 dhcp-libs                    x86_64 12:4.2.5-58.el7.centos.1     updates 130 k
 initscripts                  x86_64 9.49.39-1.el7_4.1            updates 435 k
 kernel-tools                 x86_64 3.10.0-693.17.1.el7          updates 5.1 M
 kernel-tools-libs            x86_64 3.10.0-693.17.1.el7          updates 5.1 M
 kmod                         x86_64 20-15.el7_4.7                updates 121 k
 kmod-libs                    x86_64 20-15.el7_4.7                updates  50 k
 kpartx                       x86_64 0.4.9-111.el7_4.2            updates  73 k
Installing for dependencies:
 linux-firmware               noarch 20170606-58.gitc990aae.el7_4 updates  35 M

Transaction Summary
================================================================================
Install   1 Package  (+1 Dependent package)
Upgrade  20 Packages
Updates downloaded successfully.

While this solution isn’t suited for large environments it definitely fits a need for personal cloud instances. If you are feeling frisky you can also configure yum-cron to automatically apply the updates it finds.

Generating Kubernetes pod CIDR networks with kubectl and jq

This article was posted by Matty on 2018-02-20 15:58:11 -0500 -0500

I’ve been looking into solutions to advertise pod CIDR blocks to devices outside of my Kubernetes cluster network. The Calico and kube-router projects both have working solutions to solve this problem so I’m evaluating both products. While watching the kube-router BGP demo the instructor used kubectl and jq to display the CIDR blocks assigned to each Kubernetes worker:

$ kubectl get nodes -o json | jq '.items[] | .spec'

{
  "externalID": "kubworker1.prefetch.net",
  "podCIDR": "10.1.0.0/24"
}
{
  "externalID": "kubworker2.prefetch.net",
  "podCIDR": "10.1.4.0/24"
}
{
  "externalID": "kubworker3.prefetch.net",
  "podCIDR": "10.1.1.0/24"
}
{
  "externalID": "kubworker4.prefetch.net",
  "podCIDR": "10.1.2.0/24"
}
{
  "externalID": "kubworker5.prefetch.net",
  "podCIDR": "10.1.3.0/24"
}

This is a cool use of the kubectl JSON output option and jq. Stashing this away here for safe keeping. :)

Getting the Flannel host-gw working with Kubernetes

This article was posted by Matty on 2018-02-20 09:10:06 -0500 -0500

When I first started learning how the Kubernetes networking model works I wanted to configure everything manually to see how the pieces fit together. This was a great learning experience and was easy to automate with ansible. This solution has a couple of downsides. If a machine is rebooted it loses the PodCIDR routes since they aren’t persisted to disk. It also doesn’t add or remove routes for hosts as they are added and removed from the cluster. I wanted a more permanent and dynamic solution so I started looking at the flannel host-gw and vxlan backends.

The flannel host-gw option was the first solution I evaluated. This backend takes the PodCIDR addresses assigned to all of the nodes and creates routing table entries so the workers can reach each other through the cluster IP range. In addition, flanneld will NAT the cluster IPs to the host IP if a pod needs to contact a host outside of the local broadcast domain. The flannel daemon (flanneld) runs as a DaemonSet so one pod (and one flanneld daemon) will be created on each worker. Setting up the flannel host-gw is ridiculously easy. To begin, you will need to download the deployment manifest from GitHub:

$ wget https://raw.githubusercontent.com/coreos/flannel/master/Documentation/kube-flannel.yml

Once you retrieve the manifest you will need to change the Type in the net-conf.json YAML block from vxlan to host-gw. We can use our good buddy sed to make the change:

$ sed 's/vxlan/host-gw/' -i kube-flannel.yml

To apply the configuration to your cluster you can use the kubectl create command:

$ kubectl create -f kube-flannel.yml

This will create several Kubernetes objects:

A DaemonSet will be created to run the flanneld binary on each worker.
A service account named flannel will be created to allow the pods to talk to the API server.
A ClusterRole and ClusterRoleBinding will be created to give flannel permissions at the cluster scope.
A ConfigMap will be created with the cni-conf.json and net-conf.json CNI configuration files.

To verify the pods were created and are currently running we can use the kubectl get command:

$ kubectl get pods -n kube-system -o wide

NAME                                    READY     STATUS    RESTARTS   AGE       IP             NODE
kube-flannel-ds-42nwn                   1/1       Running   0          5m        192.168.2.45   kubworker2.prefetch.net
kube-flannel-ds-49zvp                   1/1       Running   0          5m        192.168.2.48   kubworker5.prefetch.net
kube-flannel-ds-t8g9f                   1/1       Running   0          5m        192.168.2.44   kubworker1.prefetch.net
kube-flannel-ds-v6kdr                   1/1       Running   0          5m        192.168.2.46   kubworker3.prefetch.net
kube-flannel-ds-xnlzc                   1/1       Running   0          5m        192.168.2.47   kubworker4.prefetch.net

We can also use the kubectl logs command to review the flanneld logs that were produced when it was initialized:

$ kubectl logs -n kube-system kube-flannel-ds-t8g9f

I0220 14:31:23.347252       1 main.go:475] Determining IP address of default interface
I0220 14:31:23.347435       1 main.go:488] Using interface with name ens192 and address 192.168.2.44
I0220 14:31:23.347446       1 main.go:505] Defaulting external address to interface address (192.168.2.44)
I0220 14:31:23.357568       1 kube.go:131] Waiting 10m0s for node controller to sync
I0220 14:31:23.357622       1 kube.go:294] Starting kube subnet manager
I0220 14:31:24.357751       1 kube.go:138] Node controller sync successful
I0220 14:31:24.357771       1 main.go:235] Created subnet manager: Kubernetes Subnet Manager - kubworker1.prefetch.net
I0220 14:31:24.357774       1 main.go:238] Installing signal handlers
I0220 14:31:24.357869       1 main.go:353] Found network config - Backend type: host-gw
I0220 14:31:24.357984       1 main.go:300] Wrote subnet file to /run/flannel/subnet.env
I0220 14:31:24.357988       1 main.go:304] Running backend.
I0220 14:31:24.358007       1 main.go:322] Waiting for all goroutines to exit
I0220 14:31:24.358044       1 route_network.go:53] Watching for new subnet leases
I0220 14:31:24.443807       1 route_network.go:85] Subnet added: 10.1.4.0/24 via 192.168.2.45
I0220 14:31:24.444040       1 route_network.go:85] Subnet added: 10.1.1.0/24 via 192.168.2.46
I0220 14:31:24.444798       1 route_network.go:85] Subnet added: 10.1.2.0/24 via 192.168.2.47
I0220 14:31:24.444883       1 route_network.go:85] Subnet added: 10.1.3.0/24 via 192.168.2.48

To verify the PodCIDR routes were created we can log into one of the workers and run ip route show:

$ ip route show

default via 192.168.2.254 dev ens192 proto static metric 100 
10.1.1.0/24 via 192.168.2.46 dev ens192 
10.1.2.0/24 via 192.168.2.47 dev ens192 
10.1.3.0/24 via 192.168.2.48 dev ens192 
10.1.4.0/24 via 192.168.2.45 dev ens192 
172.17.0.0/16 dev docker0 proto kernel scope link src 172.17.0.1 linkdown 
192.168.2.0/24 dev ens192 proto kernel scope link src 192.168.2.44 metric 100

Sweet! My cluster IP range is 10.1.0.0/16 and the output above shows the routes this worker will take to reach cluster IPs on other workers. Now if you’re like me you may be wondering how does flannel create routes on the host when its running in a container? Here’s were the power of DaemonSets shine. Inside the deployment manifest flannel sets hostNetwork: to true:

    spec:
      hostNetwork: true
      nodeSelector:
        beta.kubernetes.io/arch: amd64
      tolerations:
      - key: node-role.kubernetes.io/master
        operator: Exists
        effect: NoSchedule

This allows the pod to access the hosts network namespace. There are a couple of items you should be aware of. The flannel manifest I downloaded from GitHub uses a flannel image from the quay.io repository. I’m always nervous about using images I don’t generate from scratch (and validate w/ digital signatures) with automated build tools. Second, if we log into one of the flannel containers and run ps:

$ kubectl exec -i -t -n kube-system kube-flannel-ds-t8g9f ash

/ # ps auxwww
PID   USER     TIME   COMMAND
    1 root       0:00 /opt/bin/flanneld --ip-masq --kube-subnet-mgr
 1236 root       0:00 ash
 1670 root       0:00 ash
 1679 root       0:00 ps auxwww

You will notice that flanneld is started with the “–kube-subnet-mgr” option. This option tells flanneld to contact the API server to retrieve the subnet assignments. This will also cause flanneld to watch for network changes (host additions and removals) and adjust the host routes accordingly. In a follow up post I’ll dig into vxlan and some techniques I found useful for debugging node-to-node communications.

Copying files into and out of Kubernetes managed containers with kubectl cp

This article was posted by Matty on 2018-02-20 08:38:52 -0500 -0500

Kubectl has some pretty amazing features built in. One useful option is the ability to copy files into and out of containers. This functionality is provided through the cp command. To copy the file app.log from the container named myapp-4dpjr to the local directory you can use the following cp syntax:

$ kubectl cp myapp-4dpjr:/app/app.log .

To copy a file into a container you can reverse the cp arguments:

$ kubectl cp debugger myapp-4dpjr:/bin

In order to be able to copy a file into a container the tar executable needs to be installed. Cool stuff.