Blog O' Matty


Running an ansible task on one node in a group

This article was posted by Matty on 2018-03-03 11:15:24 -0500 EST

I’ve been using Ansible to provision and upgrade my Kubernetes clusters. As part of bootstraping my hosts ansible installs flannel, kube-router, kube-dns and in some cases kured. The deployment manifests that are used to create these resources need to be kubectl create'ed on a single node. When I was reasoning through the best way to approach this problem two ideas came to mind:

Both options work but the second one brings up an interesting question. If my inventory contains a list of controllers:

[kubcontrollers]
kubcontroller1.homefetch.net
kubcontroller2.homefetch.net
kubcontroller3.homefetch.net

How do I ensure that my kubectl create command runs on just one node? I did some experimenting and this is actually pretty easy to do. First, I created a new group with the first node in the kubcontrollers group:

[kubmaster]
kubcontroller1.homefetch.net

Then in my playbook I checked to see if the name in inventory_hostname is in the kubmaster group. If so, I run kubectl create on just that node. Here is the YAML I created to get this working:

- name: Check to see if the flannel deployment manifest exists
  stat:
    path: "{{ kubernetes_directory }}/{{ flannel_deployment_manifest }}"
  register: flannel_config_exists
  tags: flannel

- name: Create the flannel deployment manifest if it doesn't exist
  template:
    src: {{ flannel_deployment_manifest_template }}
    dest: "{{ kubernetes_directory }}/{{ flannel_deployment_manifest }}"
    owner: root
    group: root
    mode: 0600
  register: flannel_config_changed
  tags: flannel

- name: Creating the initial flannel pods with kubectl create
  shell: "{{ KUBECTL_BINARY }} create -f {{ kubernetes_directory }}/{{ flannel_deployment_manifest }}"
  args:
    chdir: "{{ kubernetes_directory }}"
  when: >
        inventory_hostname in groups['kubmaster'] and
        flannel_config_exists.stat.exists == False and
        flannel_config_changed.changed
  tags: flannel

Jesse Keating’s Mastering Ansible and Jeff Geerling’s Ansible for DevOps sure have come in handy during the development of my Kubernetes installation and upgrade playbooks. Loves me some ansible!

Commenting out large blocks of code with vim

This article was posted by Matty on 2018-03-03 10:00:00 -0500 EST

Last night I was helping a friend debug a really cool problem. During our debugging session he saw me use vim short cuts to comment and uncomment large chunks of code and append strings to numerous lines. He thought this was pretty cool and asked me to show him how to do this. I love teaching folks and sharing my knowledge so I put him on the keyboard and walked him through each item. I also told him I would note this in my blog for future reference.

I don’t recall where I originally read about visual blocks or the mass comment short cut. Most likely I came across is when I read the vim documentation. To illustrate this useful feature let’s say you are debugging a problem and want to comment out the following section of code (thanks kured for the snippet):

func commandReboot(nodeID string) {
        log.Infof("Commanding reboot")

        if slackHookURL != "" {
                if err := slack.NotifyReboot(slackHookURL, slackUsername, nodeID); err != nil {
                        log.Warnf("Error notifying slack: %v", err)
                }
        }

        // Relies on /var/run/dbus/system_bus_socket bind mount to talk to systemd
        rebootCmd := exec.Command("/bin/systemctl", "reboot")
        if err := rebootCmd.Run(); err != nil {
                log.Fatalf("Error invoking reboot command: %v", err)
        }
}

You can type j or k to move to the func line and then hit cntrl-v to enter visual mode. Next can now use the j character to move the block down to the end of the section you want to comment. Once all of the lines you want to comment are marked you can hit shift+I to insert and then type a comment character. Then you can hit escape and the comment character will be propagated to all of the lines in your visual block. To remove the comments you can type u to undo the change. If you closed out of the file you can reverse the steps above and use x to delete the highlight characters. When you hit escape it will be removed from all of the lines in your visual block.

One other nifty short cut is the ability to append text to multiple lines. Say you have are working on the following systemd unit file:

ExecStart=/usr/local/bin/kubelet.v1.9.2 
  --allow-privileged=true 
  --anonymous-auth=false 
  --client-ca-file=/kubernetes/rootca.pem 
  --cluster-dns=10.2.0.10 

To allow options to span lines you need to append a \ character to each line. To do this quickly with vim you can use cntrl-v to highlight the ExecStart block. Once the block is highlighted you can type shift $ to go to the end of the line. To add a slash to all of the marked lines you can type shift A to append and then type a \ followed by an escape. The entire block will now have a \ appended to it. Vim is crazy powerful and I learn a new tip or trick just about every day.

Getting the bcc tools working on Fedora 27

This article was posted by Matty on 2018-02-25 09:37:34 -0500 EST

While digging through the Ansible gluster_volume module I wanted to fire up the bcc execsnoop utility to see what commands were being run. When I ran it from the shell I got the following error:

$ execsnoop | grep gluster

Traceback (most recent call last):
  File "./execsnoop", line 21, in <module>
    from bcc.utils import ArgString, printb
ImportError: cannot import name 'ArgString'

I started poking around /usr/lib/python3.6/site-packages/bcc and noticed that the AggString and printb functions weren’t present in utils.py. The version of python3-bcc that ships with Fedora 27 is a bit dated and doesn’t contain these functions. As a quick fix I downloaded the latest version of utils.py from GitHUB and replaced the version on my debug host. That fixed the problem and I was bcc’ing away. I’m planning to file a bug later today (if one doesn’t exist) to get the latest version added.

Learning how Ansible modules work with execsnoop

This article was posted by Matty on 2018-02-25 09:02:18 -0500 EST

I’ve been digging into Kubernetes persistent volumes and storage classes and wanted to create a new cluster to storage my volume claims. Ansible version 1.9 ships with the gluster_volume module which makes creating a new cluster crazy easy. The module is written in Python and you can find it in /usr/lib/python2.7/site-packages/ansible/modules/system/gluster_volume.py on Fedora-derived distributions. To create a new cluster you need to create a new file system and then define a gluster_volume task to actually create the cluster:

- name: create gluster volume 
  gluster_volume:
    state: present
    name: "{{ gluster_volume_name }}"
    bricks: "{{ gluster_filesystem/brick_name }}"
    rebalance: no
    start_on_create: yes
    replicas: "{{ gluster_replica_count }}"
    transport: tcp
    options:
      auth.allow: "{{ gluster_cal }}"
    cluster: "{{ groups.gluster | join(',') }}"
    host: "{{ inventory_hostname }}"
  run_once: true

Jeff Geerling describes how to build a cluster in gory detail and I would definitely recommend checking that out if you want to learn more about the module options listed above (his Ansible for DevOps book is also a must have). Gluster is a complex piece of software and I wanted to learn more about what was happening under the covers. While I could have spent a couple of hours reading the source code to see how the module worked I decided to fast track my learning with execsnoop. Prior to running my playbook I ran the bcc execsnoop utility and grep’ed for the string gluster:

$ execsnoop | grep cluster

This produced 25 lines of output:

chmod            8851   8844     0 /usr/bin/chmod u+x /root/.ansible/tmp/ansible-tmp-1519562230.06-193658562566672//root/.ansible/tmp/ansible-tmp-1519562230.06-193658562566672/gluster_volume.py
bash             8852   6539     0 /bin/bash -c /bin/sh -c '/usr/bin/python /root/.ansible/tmp/ansible-tmp-1519562230.06-193658562566672/gluster_volume.py; rm -rf "/root/.ansib
sh               8852   6539     0 /bin/sh -c /usr/bin/python /root/.ansible/tmp/ansible-tmp-1519562230.06-193658562566672/gluster_volume.py; rm -rf "/root/.ansible/tmp/ansib
python           8865   8852     0 /usr/bin/python /root/.ansible/tmp/ansible-tmp-1519562230.06-193658562566672/gluster_volume.py
python           8866   8865     0 /usr/bin/python /tmp/ansible_2rTonm/ansible_module_gluster_volume.py
gluster          8867   8866     0 /usr/sbin/gluster --mode=script peer status
gluster          8873   8866     0 /usr/sbin/gluster --mode=script volume info
gluster          8879   8866     0 /usr/sbin/gluster --mode=script peer probe gluster01.prefetch.net
gluster          8885   8866     0 /usr/sbin/gluster --mode=script volume create kubernetes-volume1 replica 3 transport tcp gluster01.prefetch.net:/bits/kub1 gluster02.prefetch.net:/bits/kub1 gluster03.prefetch.net:/bits/kub1
S10selinux-labe  8891   6387     0 /var/lib/glusterd/hooks/1/create/post/S10selinux-label-brick.sh --volname=kubernetes-volume1
gluster          8892   8866     0 /usr/sbin/gluster --mode=script volume info
gluster          8898   8866     0 /usr/sbin/gluster --mode=script volume start kubernetes-volume1
glusterfsd       8904   6387     0 /usr/sbin/glusterfsd -s gluster01.prefetch.net --volfile-id kubernetes-volume1. gluster01.prefetch.net.bits-kub1 -p /var/run/gluster/vols/kubernetes-volume1/gluster01.prefetch.net-bits-kub1.pid -S /var/run/gluster/e2eaf5fee7c4e7d6647dc6ef94f8a342.socket --brick-name /bits/kub1 -l /var/log/glusterfs/bricks/bits-kub1.log --xlator-option *-posix.glusterd-uuid=6aa04dc6-b578-46a5-9b9e-4394c431ce42 --brick-port 49152 --xlator -option kubernetes-volume1-server.listen-port=49152
glusterfs        8925   8924     0 /usr/sbin/glusterfs -s localhost --volfile-id gluster/glustershd -p /var/run/gluster/glustershd/glustershd.pid -l /var/log/glusterfs/glustershd.log -S /var/run/gluster/6eb5edea939911f91690c1181fb07f5d.socket --xlator-option *replicate*.node-uuid=6aa04dc6-b578-46a5-9b9e-4394c431ce42
S29CTDBsetup.sh  8935   6387     0 /var/lib/glusterd/hooks/1/start/post/S29CTDBsetup.sh --volname=kubernetes-volume1 --first=yes --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
getopt           8937   8935     0 /usr/bin/getopt -l volname: -name ctdb --volname=kubernetes-volume1 --first=yes --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
S30samba-start.  8938   6387     0 /var/lib/glusterd/hooks/1/start/post/S30samba-start.sh --volname=kubernetes-volume1 --first=yes --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
getopt           8939   8938     0 /usr/bin/getopt -l volname:,gd-workdir: -name Ssamba-start --volname=kubernetes- volume1 --first=yes --version=1 --volume-op=start --gd-workdir=/var/lib/glusterd
grep             8945   8944     0 /usr/bin/grep user.smb /var/lib/glusterd/vols/kubernetes-volume1/info
gluster          8947   8866     0 /usr/sbin/gluster --mode=script volume set kubernetes-volume1 auth.allow 192.168.2.*
S30samba-set.sh  8956   6387     0 /var/lib/glusterd/hooks/1/set/post/S30samba-set.sh --volname=kubernetes-volume1 -o auth.allow=192.168.2.* --gd-workdir=/var/lib/glusterd
getopt           8959   8956     0 /usr/bin/getopt -l volname:,gd-workdir: --name Ssamba-set -o o: -- --volname=kubernetes-volume1 -o auth.allow=192.168.2.* --gd-workdir=/var/lib/glusterd
grep             8965   8964     0 /usr/bin/grep status /var/lib/glusterd/vols/kubernetes-volume1/info
S32gluster_enab  8967   6387     0 /var/lib/glusterd/hooks/1/set/post/S32gluster_enable_shared_storage.sh --volname=kubernetes-volume1 -o auth.allow=192.168.2.* --gd-workdir=/var/lib/glusterd
gluster          8974   8866     0 /usr/sbin/gluster --mode=script volume info

The first five lines are used to setup up the ansiballz module and the remaining lines contain the commands ansible runs to create the cluster on one of my gluster noes. While I love reading code sometimes you just don’t have time and this was one of them.

Making sense Of Linux namespaces

This article was posted by Matty on 2018-02-22 20:00:00 -0500 EST

Containers are pretty amazing. What’s more amazing is the technology in the Linux kernel that makes them a reality. As Jesse Frazelle mentioned in her containers aka crazy user space fun keynote containers aren’t actual things. They are a combination of Linux kernel technologies. Specifically, cgroups, namespaces and LSMs. To get a better understanding of how namespaces work I sat down last weekend and read a bunch of code and dug into some containers running on one of my Kubernetes workers. I will share what I found in this blog post.

So what exactly is a Linux namespace? The manual page for namespaces(7) gives an excellent overview:

A namespace wraps a global system resource in an abstraction that makes it appear to the processes within the namespace that they have their own isolated instance of the global resource. Changes to the global resource are visible to other processes that are members of the namespace, but are invisible to other processes. One use of namespaces is to implement containers.

Their are currently seven types of namespaces in the Linux kernel:

To see the namespaces on a server you can run the lsns utility:

$ lsns

        NS TYPE   NPROCS   PID USER   COMMAND
4026531835 cgroup    122     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531836 pid       114     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531837 user      122     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531838 uts       115     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531839 ipc       114     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026531840 mnt       110     1 root   /usr/lib/systemd/systemd --switched-root --system --deserialize 24
4026532803 mnt         1  1491 root   /pause
4026532804 uts         1  1491 root   /pause
4026532805 ipc         2  1491 root   /pause
4026532806 pid         1  1491 root   /pause
4026532810 net         2  1491 root   /pause
4026532884 mnt         1  3075 root   /usr/local/bin/kube-router
4026532885 pid         1  3075 root   /usr/local/bin/kube-router
4026532889 mnt         1  1768 nobody /app1
4026532890 uts         1  1768 nobody /app2
4026532891 pid         1  1768 nobody /app3
.......

Processes can be assigned to a namespace in two ways. The first is via one or more CLONE_NEWXXXX flags passed to the clone system call. These are all documented in namespaces(7). The second way is through the setns(2) system call. Clone allows new newspaces to be created and assigned to a process while setns adds a process to an existing namespace (this is how sidecars work). We can observe how setns works with the nsenter and bcc trace utilities:

$ nsenter -t 29221 --mount --uts --ipc --net --pid sh

$ trace 'sys_setns "setns was called: FD: %d NSTYPE: 0x%x", arg1, arg2'

PID     TID     COMM            FUNC             -
30465   30465   nsenter         sys_setns        setns was called: FD: 3 NSTYPE: 0x8000000
30465   30465   nsenter         sys_setns        setns was called: FD: 4 NSTYPE: 0x4000000
30465   30465   nsenter         sys_setns        setns was called: FD: 5 NSTYPE: 0x40000000
30465   30465   nsenter         sys_setns        setns was called: FD: 6 NSTYPE: 0x20000000
30465   30465   nsenter         sys_setns        setns was called: FD: 7 NSTYPE: 0x20000

In the example above I’m using the nsenter utility to enter the mount, uts, ipc, net and pid namespaces for process id 29221. Once the setns() system call completes execve() will start the program passed as an argument in that namespace. NSTYPE is displayed as a hexidecimal number but the actual namespace type is passed to setns as an int. The Linux kernel maintains a list of namespace types in /usr/include/linux/sched.h which we can view with grep:

$ grep CLONE_NEW /usr/include/linux/sched.h

#define CLONE_NEWNS	0x00020000	/* New mount namespace group */
#define CLONE_NEWCGROUP		0x02000000	/* New cgroup namespace */
#define CLONE_NEWUTS		0x04000000	/* New utsname namespace */
#define CLONE_NEWIPC		0x08000000	/* New ipc namespace */
#define CLONE_NEWUSER		0x10000000	/* New user namespace */
#define CLONE_NEWPID		0x20000000	/* New pid namespace */
#define CLONE_NEWNET		0x40000000	/* New network namespace */

The third field represents the bit pattern for the CLONE flag and can be matched up to the NSTYPE value listed above. Each processs on a system has their process id registered in /proc. Inside that pid directory is a ns directory which lists all of the namespaces (and their ids) associated with the process. Let’s take a look at the ns entries for a docker shim process running on a Kubernetes worker:

$ cd /proc/3878/ns

$ ls -la

total 0
dr-x--x--x. 2 root root 0 Feb 15 20:54 .
dr-xr-xr-x. 9 root root 0 Feb 15 20:54 ..
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 cgroup -> cgroup:[4026531835]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 ipc -> ipc:[4026531839]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 mnt -> mnt:[4026531840]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 net -> net:[4026531993]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 pid -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 pid_for_children -> pid:[4026531836]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 user -> user:[4026531837]
lrwxrwxrwx. 1 root root 0 Feb 15 23:56 uts -> uts:[4026531838]

In the output above we can see that there is one entry per namespace type. Now you may be asking yourself how does all of this relate to Kubernetes? Well I’m glad you are a kickass reader and posed that question! When the kubelet is asked to create a new container it will actually spin up two. One container will be used to execute your application and the second container will be used to run a pause container. If you want to learn all about the differences between the two types of containers you can check out the amazing pod and pause container posts from Ian Lewis.

Now what makes this interesting is the way the namespaces are shared between containers in a pod. To better understand the relationships between containers in a namespace I fired up a pod with kubectl:

$ kubectl run -it --image=centos centos bash

Next I located my pods with the docker ps command:

$ docker ps -a | grep centos

53856330119b        docker.io/centos@sha256:2671f7a3eea36ce43609e9fe7435ade83094291055f1c96d9d1d1d7c0b986a5d                            "bash"                   5 minutes ago       Up 5 minutes                                    k8s_centos_centos-6b448dc49b-h89dp_default_d594cab9-12ba-11e8-97d5-525400b9b561_0
aac7ecdfe5e4        gcr.io/google_containers/pause-amd64:3.0                                                                            "/pause"                 5 minutes ago       Up 5 minutes                                    k8s_POD_centos-6b448dc49b-h89dp_default_d594cab9-12ba-11e8-97d5-525400b9b561_0

This turned up two containers. One for the pause container and another for the application I am trying run. To further understand the relationship between both containers I grabbed their PIDs:

$ docker inspect $(docker ps -a | awk '/centos/ { print $1}') -f '{{ .State.Pid }}'

2663
2589

And spent a few minutes developing a Python script to list namespace ids and the process ids that are associated with them. Running the script and grep’ing for the PIDs above produced the following:

$ sharedns.py | egrep '2663|2589|Name'

Namespace ID          Namespace   Pids Sharing Namespace
4026532249            mnt         2589                
4026531835            cgroup      1,2,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,32,49,81,82,83,84,85,86,87,88,89,96,275,292,293,350,365,373,413,432,435,440,466,538,541,642,780,781,784,785,791,909,913,916,918,923,924,946,947,1004,1094,1095,1129,1132,1183,1201,1937,1952,2576,2589,2649,2663,2769,5195,5196
4026531837            user        1,2,4,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,32,49,81,82,83,84,85,86,87,88,89,96,275,292,293,350,365,373,413,432,435,440,466,538,541,642,780,781,784,785,791,909,913,916,918,923,924,946,947,1004,1094,1095,1129,1132,1183,1201,1937,1952,2576,2589,2649,2663,2769,5195,5196
4026532254            net         2589,2663           
4026532314            mnt         2663                
4026532315            uts         2663                
4026532316            pid         2663                
4026532251            ipc         2589,2663           
4026532250            uts         2589                
4026532252            pid         2589   

This helped me visualize the namespace associations and walk away with the following notes:

If you are using docker and your container needs access to a namespace outside of the default namespaces it was assigned (net & ipc) you can use the following options to add it to additional namespaces:

$ docker run --help | grep namespace

      --ipc string                            IPC namespace to use
      --pid string                            PID namespace to use
      --userns string                         User namespace to use
      --uts string                            UTS namespace to use

This was an amazing journey and I feel like I have a solid grasp of how namespaces work now. If you want to learn more about the actual implementation inside the Linux kernel you can check out LWNs two part series (part one and part two) on this super cool topic! Hit me up on twitter if you have any feedback!